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Abstract  (continued) 


Existing  target  acquisition  models  tend  to  base  performance  on  (a)  one-dimensional  (1-D)  metrics  defining  the 
amount  of  information  in  the  target  (e.g.,  resolvable  bar  cycles,  contrast,  area,  size,  perimeter,  speed  of  motion)  and 
how  that  information  correlates  to  level  of  performance  in  a  target  acquisition  task  (i.e.,  detection,  classification, 
recognition,  and  identification),  (b)  search  processes  that  are  unrealistic  (e.g.,  that  assume  random  eye  movements), 
and  (c)  1-D  metrics  to  define  the  whole  scene  (clutter)  or  regions  of  the  scene  (e.g.,  clutter,  conspicuity, 
attractiveness).  These  tendencies  fail  to  account  for  known  human  behavior,  although  models  incorporating  them 
may  be  insensitive  to  the  details  of  human  performance  because  they  predict  ensemble  rather  than  individual 
performance. 

Phenomena  from  perceptual  psychology  known  to  affect  target  acquisition  are  reviewed  in  terms  of  how  target 
acquisition  models  do  and  do  not  account  for  them.  Such  factors  include  motion,  color,  and  visual  transients.  Basic 
models  of  visual  search  are  included  as  guides  for  how  target  acquisition  models  may  incorporate  some  of  these 
factors. 

Visual  selective  attention  is  recommended  as  a  means  for  the  theoretically  meaningful  inclusion  of  psychologically 
important  factors  into  target  acquisition  modeling. 
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1.  Purpose,  Objectives,  and  Scope 


This  technical  report  is  part  of  a  technology  program  annex  (TP A)  with  the  U.S.  Army  Materiel 
Systems  Analysis  Activity  (AMSAA)  that  defines  its  purpose  and  objective  and  outlines  particular 
topics  of  interest  as  follow. 

1.1  Purpose 

This  TPA  defines  the  proposed  responsibility  of  the  U.S.  Army  Research  Laboratory’s  (ARL) 
Human  Research  Engineering  Directorate  in  support  of  the  AMSAA  to  perfonn  human  response- 
based  activities  that  will  provide  improved  search  and  target  acquisition  analysis  tools,  techniques, 
and  methodologies. 

1.2  Objectives 

ARL  proposes  to  establish  a  methodology  development  program  that  emphasizes  the  description 
and  definition  of  the  human  processes  of  search  and  acquisition  of  military  targets  in  realistic  back¬ 
grounds  and  the  relationship  between  them. 

1.3  Scope 

1.3.1  Topics  of  Interest 

This  review  will  survey  relevant  research  in  target  acquisition  and  highlight  the  state  of  the  art  in 
modeling  particular  aspects  of  perfonnance  including  those  of  (a)  the  target:  target  type,  number, 
signature  variation,  cues  (e.g.,  glint,  muzzle  flash),  and  representation,  (b)  the  target-acquisition 
environment:  effects  of  background  and  foreground,  local  and  global  environmental  variation,  type 
of  environment  (e.g.,  tropical,  jungle,  desert),  day  versus  night  viewing,  and  clutter,  (c)  sensor 
parameters:  field  of  view  (FOV),  resolution,  and  stereoscopic  versus  non-stereoscopic,  and  (d)  type 
of  search:  FOV,  field  of  regard  (FOR),  time  required  to  search,  detect,  recognize,  and  identify 
targets.  Additional  topics  of  particular  interest  are  as  follow: 

Particular  attention  will  be  paid  to  the  Johnson  criteria,  and  to  the  ACQUIRE  and  Night  Vision 
and  Electronic  Sensors  Directorate  (NVESD1)  models  since  they  or  portions  of  them  are  used  by 
AMSAA  in  current  simulation  efforts  (e.g.,  Mazz,  1998).  These  models  also  serve  as  the  basis  for 
ongoing  attempts  to  integrate  additional  scene  and  observer  parameters  such  as  motion  (e.g., 
Meitzler,  Kistner  et  al.,  1998),  multiple  observers  (Rotman,  1989),  scene  obscurants  (Rotman, 
Gordan,  &  Kowalczyk,  1989),  clutter  (Tidhar  et  al.,  1994),  and  multiple  targets  (Rotman,  Gordan, 

&  Kowalczyk,  1989)  and  selective  visual  attention2.  As  such,  it  is  important  to  know  the 
limitations  and  theoretical  extensibility  of  the  models. 


'NVESD  is  part  of  the  U.S.  Army  Research,  Development,  and  Engineering  Command’s  Communications  and 
Electronics  Research,  Development,  and  Engineering  Center. 

“The  author  of  this  report  is  involved  in  ongoing  research  into  the  role  of  selective  visual  attention  in  target 
acquisition.  One  goal  of  the  research  is  to  determine  if  ACQUIRE’s  performance  can  be  improved  by  the  inclusion 
of  attention  parameters.  ACQUIRE  is  not  an  acronym. 
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1.3.2  Perceptual  Psychology  and  How  It  Can  Inform  Target  Acquisition  Modeling 

The  greatest  theoretical  advances  to  understanding  visual  search  processes  have  occurred  in  the 
reductionistic  environments  of  academic  perception  laboratories.  The  resulting  models  and 
theories  may  be  of  limited  direct  applicability  to  military  target  acquisition  scenarios.  However, 
they  constrain  models  and  inform  the  reader  about  known  visual  phenomena  relevant  to  target 
acquisition.  Current  models  from  the  perceptual  literature  are  discussed  in  terms  of  their 
generalizability  to  the  battlefield. 

1.3.3  How  Performance  is  Measured 

Different  models  use  different  measures  as  predictors  of  performance  (e.g.,  response  time, 
observer  sensitivity  [d'],  false  detection  percentage,  probability  of  detection,  etc.).  Models  may  not 
be  directly  comparable  in  that  the  dependent  measures  (a)  do  not  necessarily  map  onto  each  other 
in  a  well-defined  way,  and  (b)  may  not  exchange  predictably  as  observer  and  scene  parameters 
change.  To  the  extent  possible,  models  are  discussed  in  terms  of  how  these  various  dependent 
measures  may  be  differentially  affected  by  parameter  changes. 

1.3.4  Issues  Related  to  the  Validation  and  Testing  of  Models 

The  author  of  this  report  made  no  attempt  to  instantiate  the  models  in  software  or  hardware  in 
order  to  evaluate  them  head  to  head.  There  is  a  brief  discussion  of  issues  related  to  the  validation 
of  models  and  the  need  for  a  robust  data  set  to  perfonn  laboratory  studies  of  models  before  field 
trials. 

The  scope  of  the  review  includes  non-classified  literature  from  the  defense  and  the  academic 
communities  that  relate  to  the  acquisition  of  ground  targets.  Although  target  acquisition  models 
date  back  several  decades  (see  Greening,  1974,  for  a  review  of  early  efforts),  this  review  focuses 
on  identifying  the  state  of  the  art  in  modeling  and  discusses  only  classic  models  that  have  broken 
new  ground  and  are  still  of  theoretical  interest.  Models  from  the  perceptual  psychology  literature 
are  also  discussed  for  their  role  in  promulgating  new  theoretical  ideas  that  may  or  may  not  be 
generalizable  to  real-world  target  acquisition. 


2.  Introduction 


Before  target  acquisition  models  can  be  discussed,  it  is  important  to  define  tenns  that  appear 
throughout  this  report.  Bliss  pointed  out  in  1974  that  no  clear  standards  existed  for  what  is 
specifically  meant  by  the  term  “target  acquisition.”  Since  then,  models  and  theories  of  how 
targets  can  be  acquired  from  various  disciplines  (e.g.,  machine  vision,  perceptual  psychology, 
military  simulation,  electro-optical  design)  have  proliferated.  However,  there  remains  an 
absence  of  standards  for  basic  terms. 
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In  1990,  the  Quadripartite  Working  Group  on  Army  Operational  Research  proposed  standard 
definitions  that  are  used  in  this  report  when  we  discuss  target  acquisition  models.  Some 
definitions  from  that  working  group  are 

•  Target  Acquisition 

All  those  processes  required  to  locate  a  target  image  whose  position  may  be  uncertain  and  to 
discriminate  it  to  the  desired  level  (detection,  classification,  recognition,  identification).  The 
target  acquisition  process  includes  the  search  process  at  the  end  of  which  the  target  is  located  and 
the  discrimination  process  at  the  end  of  which  the  target  is  acquired.  This  definition  assumes 
that  a  time-dependent  search  process  is  involved.  However,  target  acquisition  may  involve  the 
discrimination  of  a  target  whose  position  is  known  ahead  of  time.  Such  a  static  process  is 
assumed  to  be  the  same  as  the  discrimination  stage  of  the  above-defined  target  acquisition 
process. 

•  Search 

The  process  of  visually  sampling  the  search  field  in  an  effort  to  locate  or  acquire  targets. 

•  Discrimination 

A  process  in  which  an  object  is  assigned  to  a  subset  of  a  larger  set  of  objects,  based  on  the 
amount  of  detail  perceived  by  the  observer,  and  the  application  of  knowledge  of  those  details 
sufficient  to  afford  such  an  action. 

•  Detection 

The  perception  of  an  object  image  (which  may  be  a  target  image)  as  being  present  at  a  particular 
location  and  distinct  from  its  surroundings. 

•  Classification 

The  determination  of  whether  a  detected  object  is  a  member  of  a  particular  set  of  possible  targets 
or  non-targets  (e.g.,  wheeled  versus  tracked  vehicles). 

•  Recognition 

The  determination  that  a  target  belongs  to  a  particular  functional  category  (e.g.,  a  tank,  a  truck, 
an  armored  personnel  carrier,  etc.). 

•  Identification 

The  most  detailed  level  of  discrimination  of  particular  relevance  for  military  target  acquisition, 
as  discussed  shortly  (e.g.,  a  T-72,  T-62,  Ml,  or  M60  tank). 

Inherent  in  these  definitions  of  the  processes  involved  in  target  acquisition  are  the  ideas  that  first, 
information  must  be  extracted  from  the  scene  and  second,  that  the  Soldier  in  the  loop  must  be 
able  to  use  such  infonnation  to  make  an  appropriate  decision.  (In  some  cases,  the  decision  made 
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must  be  that  the  information  in  the  scene  is  insufficient  even  for  detection.  In  such  cases,  the 
decision  made  by  the  Soldier  is  the  declaration  that  no  target  is  present.)  In  addition  to 
information-related  constraints,  the  Soldier  must  have  both  the  perceptual  capability  to  perceive 
and  the  cognitive  ability  to  understand  the  infonnation  in  order  to  employ  it.  Although  this  fact 
may  seem  obvious,  modeling  the  observer’s  decision-making  process  is  no  simple  feat. 

The  goal  of  this  technical  report  is  to  provide  an  overview  of  the  literature  relevant  to  the 
modeling  of  the  human  in  the  loop  in  target  acquisition.  Figure  1  highlights  the  flow  of 
information  in  the  target  acquisition  process  from  the  visual  infonnation  in  the  scene  through  any 
optical  or  electro-optical  sensor  systems  to  the  human  visual  system  and  finally,  to  the  observer’s 
decision-making  processes.  This  report  highlights  the  difficulties  associated  with  target 
acquisition,  which  arise  from  each  of  these  levels,  with  particular  emphasis  on  the  last  three 
elements  in  which  the  human  observer  is  given  a  scene,  either  optically  or  electro-optically,  from 
which  he3  attempts  to  extract  information  and  acquire  a  target. 


Figure  1.  The  flow  of  information  in  human-in-the-loop  target  acquisition. 

Before  we  detail  the  complexity  associated  with  the  elements  of  the  human-in-the-loop  target 
acquisition  process  and  how  they  influence  modeling  the  target  acquisition  process,  it  is  useful  to 
briefly  say  why  the  human  is  in  the  target  acquisition  loop  to  begin  with.  Although  research  into 
automatic  target  recognition  (ATR)  and  aided  target  recognition  proceeds  at  a  rapid  pace,  current 
ATR  systems  lack  sufficient  accuracy  and  flexibility  to  allow  them  to  take  over  the  process  of 
target  acquisition  from  humans  (e.g.,  Dudgeon,  1998).  The  deficiencies  of  ATR  become 
particularly  apparent  when  they  are  called  upon  (a)  to  perform  acquisition  tasks  when  the  space 
of  possible  targets  is  large,  and  (b)  when  non-visual  factors  such  as  situational  context, 
experience,  and  judgment  must  be  taken  into  account  before  an  action  is  taken  regarding  a 
potential  target.  Therefore,  the  human  observer  must  be  available  to  make  the  final  decision 
regarding  action  (or  inaction)  in  the  target  acquisition  situation. 

Given  that  the  human  remains  firmly  in  the  loop  for  the  foreseeable  future,  as  the  decision  maker 
and  as  the  actual  acquirer  of  potential  targets,  it  is  imperative  to  understand  the  factors  known  to 
have  an  impact  on  Soldier  perfonnance  in  real-world  target  acquisition  performance.  Table  1 
lists  several  such  factors,  broken  into  their  effects  on  the  visual  display  of  the  scene  in  which 
target  acquisition  is  to  be  performed,  and  their  effects  on  the  decision-making  process  of  the 
observer  (from  Howe,  1993). 


The  male  gender  pronoun  “he”  is  used  throughout  this  technical  report  in  order  to  facilitate  readability. 
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Table  1.  Factors  known  to  affect  performance  of  human  in  the  loop. 


Locus  of  Effect  of  Factors 

Factors 

Visual  display  of  scene 

Target  type,  size,  shape,  contrast  with  immediate  background,  motion, 
shadow,  masking  by  background  elements,  camouflage,  scene  clutter, 
transient  cues.  Environmental  visibility,  cloud  cover,  sun  angle,  diurnal  and 
seasonal  variation,  atmospheric  scattering,  illumination  level,  field  of  view. 

Decision  making  of  observer 

Training,  motivation,  experience,  expectations  for  possible  targets,  stress, 
concurrent  task  load,  visual  acuity,  search  pattern,  fatigue,  field  of  regard, 
attentional  set. 

In  addition  to  factors  in  table  1,  there  are  factors  that  depend  on  the  sensor  system  being  used. 

For  instance,  although  table  1  may  suffice  to  encompass  factors  relevant  to  an  observer  viewing  a 
scene  with  the  unaided  eye,  additional  factors  such  as  display  resolution,  phosphor  decay  rates, 
sensor  temporal  and  spatial  resolution,  atmospheric  turbulence  and  scattering,  and  target 
emittance  and  temperature  must  be  added  in  order  to  account  for  perfonnance  variability  when 
one  is  viewing  FLIR  (forward-looking  infrared  radar)  imagery.  Various  models  may  take  such 
factors  into  account  (or  fail  to  do  so  at  their  peril)  when  we  are  attempting  to  predict  Soldier 
performance  with  various  electro-optical  devices. 

Because  no  single  model  can  possibly  include  all  factors  known  to  influence  target  acquisition 
performance,  models  will  account  for  some  of  the  factors  and  ignore  others  for  theoretical 
reasons.  (Such  an  approach,  this  reviewer  would  argue,  is  the  only  likely  way  these  factors  will 
ever  be  understood  with  the  depth  necessary  to  model  them.) 

The  observer  factors  listed  in  table  1  may  become  especially  acute,  given  the  increasing  demands 
placed  on  the  individual  Soldier  by  technology.  Soldiers  are  called  upon  to  use  ever-more 
sophisticated  sensor  systems  and  will  therefore  be  forced  to  deal  effectively  with  an  ever- 
increasing  amount  of  information  about  the  scene.  In  addition  to  the  increasing  cognitive  and 
sensory  demands  placed  on  the  Soldier  by  technology,  potential  enemies  also  use  improvements 
in  camouflage,  concealment,  and  deception  (CCD)  technology  to  better  hide  themselves. 
Therefore,  it  seems  obvious  that  any  understanding  of  the  human  in  the  loop  must  account  for 
observer  variables  and  how  they  interact  with  factors  influencing  the  display  of  visual 
information  to  the  observer. 

2,1  The  Goals  of  Target  Acquisition  Modeling 

There  are  several  reasons  why  it  is  desirable  to  predict  target  acquisition  performance.  These 
reasons  include 

1 .  Better  Soldier  training 

Training  is  costly  and  time  consuming.  Learning  why  Soldiers  perform  as  they  do  and 
understanding  the  influences  that  experience,  knowledge,  and  expectations  have  on  acquisition 
performance  may  allow  for  better  and  more  efficient  training  of  Soldiers.  For  example,  if  a 
particular  kind  of  terrain  is  known  to  cause  problems  in  tank  identification,  then  training  may 
focus  on  providing  more  experience  with  the  particular  target-terrain  interaction. 
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2.  Reduced  fratricide 


Current  weapon  systems  are  accurate  and  lethal  at  ranges  that  often  far  exceed  the  identification 
range  of  the  Soldier  controlling  the  weapon.  Misidentification  may  therefore  lead  to  missing  an 
enemy  or  firing  on  a  comrade.  Understanding  when  and  why  such  misidentifications  occur  may 
inform  the  development  of  better  sensor  systems  or  training  in  order  to  reduce  those  errors. 

3.  Improved  sensor  systems 

Sensor  systems  that  provide  the  image  to  the  Soldier  in  the  loop  cannot  be  evaluated  properly 
unless  we  know  what  aspects  of  the  sensor  display  (i.e.,  the  rendered  scene)  have  an  impact  on 
Soldier  performance.  Also,  a  functional  model  of  the  human  in  the  loop  will  allow  for  sensors  to 
be  evaluated  before  production,  thus  reducing  costs  while  increasing  Soldier  effectiveness. 

4.  More  effective  CCD  techniques 

The  flip  side  of  knowing  the  circumstances  in  which  particular  targets  will  be  difficult  to  acquire 
will  allow  the  Army  to  take  advantage  of  those  situations  in  order  to  make  detection  of  our  own 
forces  more  difficult. 

2.2  Approach  of  the  Author  and  Format  of  Review 

Models  of  theoretical  or  historical  importance  are  included  to  paint  a  relatively  complete  picture 
of  the  current  state  of  target  acquisition  modeling  with  respect  to  the  domain  specified  in  the 
TPA.  Major  models  are  classified  along  a  set  of  five  dimensions  (described  next)  and  discussed. 
Theoretical  details  of  the  models  are  discussed  in  tenns  of  the  aspects  of  the  scenes  and  observer 
variables  accounted  for,  dependent  measures  predicted,  and  possible  theoretical  and  empirical 
shortcomings.  As  mentioned  previously,  the  author  did  not  attempt  to  instantiate  any  of  the 
models  for  a  direct  comparison.  Rather,  the  literature  reviewed  in  this  report  is  described  and 
critiqued  in  tenns  of  agreement  with  empirical  findings4  and  with  theoretical  understanding  of 
human  visual  processing. 


3.  Model  Description  Scheme 


A  five-dimensional  descriptive  framework  is  outlined.  The  inherent  strengths  and  weaknesses  of 
models  at  various  points  of  the  dimensions  are  discussed.  All  models  reviewed  in  detail  are 
given  scores  along  the  dimensions. 


4Since  no  experiments  were  done  by  the  author,  the  empirical  tests  of  most  models  come  from  the  respective 
authors  themselves  or  from  third  parties  who  instantiated  and  tested  the  models  directly. 
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In  order  to  make  sense  of  the  wide  array  of  literature  about  target  acquisition  perfonnance 
models,  it  is  useful  to  rate  each  model,  based  on  dimensions  describing  aspects  of  its  function 
and  the  domain  over  which  it  may  be  used.  Five  dimensions  were  selected5,  based  on  Greening 
(1974): 

1.  optical/objective . cognitive/subjective 

This  dimension  refers  to  the  locus  of  the  observer’s  information  processing.  That  is,  does  the 
observer  make  his  decision  on  the  basis  of  the  visual  information  in  the  scene  or  on  his 
subjective  interpretation  of  what  he  perceives  the  visual  percept  to  be?  Modeling  the  fonner  is 
straightforward  in  that  all  the  infonnation  used  to  make  the  decision  is  readily  available  to  the 
modeler.  Modeling  the  latter  is  more  problematic  because  inferences  must  be  made  about  the 
cognitive  processing  that  the  observer  performs  to  reach  a  decision. 

2.  reductive . comprehensive 

This  dimension  expresses  the  possible  extremes  of  approach  in  terms  of  how  much  of  the  target 
acquisition  process  is  to  be  accounted  for  by  the  model.  (This  dimension  correlates  highly  with 
the  generalizability  of  the  model.)  Reductive  models  are  easy  to  support  or  disprove  since  they 
make  testable  predictions.  However,  such  models  lack  sufficient  detail  to  extend  their  predic¬ 
tions  to  real-world  situations.  Comprehensive  models  take  many  factors  into  account  but  may 
suffer  from  a  combinatorial  explosion  of  possible  interactions  and  may  be  difficult  to  verify;  tests 
of  such  models  may  lack  sufficient  statistical  power  to  tease  apart  the  effects  of  one  or  another 
factor. 

3.  target-centered . situation-centered 

This  dimension  expresses  the  range  of  information  given  in  the  scene  that  the  subject  can  use  to 
aid  in  acquiring  the  target.  For  example,  a  purely  target-centered  scene  may  contain  a  tank 
parked  on  a  uniform  texture  field  (i.e.,  no  information  in  the  scene  guides  searches  for  the  target 
except  the  target  itself).  At  the  other  extreme  is  a  scene  containing  mountainous  terrain  and  a 
number  of  roads  upon  which  a  target  must  travel.  In  this  case,  the  roads  guide  the  search  for  the 
target  to  such  an  extent  that  the  target  may  become  immediately  apparent.  Purely  target-centered 
models  exist  primarily  in  studies  of  perceptual  psychology  and  psychophysics  or  as  a  means  of 
testing  specific  predictions  about  factors  affecting  performance.  Situation-centered  models,  on 
the  other  hand,  are  more  realistic  but  must  make  more  assumptions  about  the  cognitive  processes 
underlying  acquisition  performance. 

4.  physiological . empirical 

This  dimension  refers  to  the  degree  to  which  the  model  is  based  on  human  visual  physiology  or 
on  curve  fits  to  previously  collected  empirical  data.  Between  the  two  extremes  lie  models  that 
base  their  perfonnance  predictions  on  known  human  psychophysics.  Such  psychophysical 

5Note  that  no  attempt  was  made  to  demonstrate  the  orthogonality  of  these  dimensions. 
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models  may  rely  either  on  psychometric  functions  (i.e.,  be  more  empirical)  or  on  the  physiology 
that  underlies  the  psychometric  functions  (i.e.,  be  more  physiological).  Models  that  are  more 
physiological  have  the  potential  of  being  applicable  to  a  greater  variety  of  situations,  although 
the  models  typically  have  more  parameters  to  “tweak”  to  make  them  work,  and  the  values  of 
those  parameters  may  not  have  strong  theoretical  underpinnings. 

5.  individual . ensemble 

This  dimension  refers  to  whether  the  model  attempts  to  (or  is  able  to)  predict  perfonnance  for  an 
individual  observer  or  an  ensemble  of  observers.  Although  this  dimension  may  at  first  glance 
appear  to  be  a  simple  dichotomy,  the  breakdown  is  not  so  clear.  For  example,  it  would  be  a 
simple  matter  for  an  individual  performance-based  model  to  predict  ensemble  performance  by 
processing  groups  of  individuals,  but  it  may  or  may  not  be  possible  for  an  ensemble-based  model 
to  step  down  to  performance  prediction  at  the  level  of  the  individual.  The  implications  of  this 
asymmetry  come  into  play  in  tenns  of  the  inclusion  of  observer  variables  in  that  ensemble 
models  typically  assume  the  presence  of  “trained  military  observers”  (e.g.,  O’Kane,  1995)  and 
allow  little  theoretical  room  for  the  addition  of  individual  factors. 


4.  Basic  Types  of  Models 


Although  literally  hundreds  of  models  have  emerged  over  the  years,  the  bulk  of  the  models 
reviewed  in  this  report  fall  into  a  few  basic  classes.  These  classes  are  discussed. 

This  review  of  the  literature  divides  the  space  of  existing  models  into  four  broad  types,  as 
determined  by  the  underlying  processes  that  the  model  assumes  drive  performance.  The  classes 
are 

1 .  Models  based  on  physiology  and  empirical  human  psychophysics, 

2.  Models  based  on  non-physiological  feature  extraction, 

3.  Models  based  on  theoretical  constructs  and  scene  descriptions/metrics,  and 

4.  Models  based  on  largely  atheoretical  fits  to  empirical  data. 

We  mention  where  each  type  of  model  lies  along  the  five  dimensions  listed.  Examples  of  such 
models  are  given,  and  the  strengths  and  limitations  of  such  models  are  discussed.  It  will  be  clear 
that  there  are  models  that  do  not  fit  neatly  into  one  type  but  contain  characteristics  of  several 
types.  In  such  cases,  the  classification  is  based  on  the  information  purported  to  be  used  by  the 
observer  to  make  a  decision. 
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1 .  Models  based  on  physiology  and  empirical  human  psychophysics: 


oDtical/obiective  H  1 - 

- 1  cognitive/subjective 

reductive  1 - 1 

1 - 1  comprehensive 

target-centered  HZZI - 

- 1  situation-centered 

physiological  H _ 

_ hi  empirical 

individual  HUI 

1  ensemble 

These  models  base  their  performance  predictions  on  how  the  human  visual  system  is  known  to 
respond  to  simple  stimuli.  That  is,  the  models  take  what  is  known  about  vision  from 
physiological  studies  of  the  visual  system  (e.g.,  Hubei  &  Wiesel,  1962,  1968;  Campbell  & 
Robson,  1968)  and  psychophysical  studies  of  how  physical  stimuli  determine  overt  perception 
and  perfonnance  (e.g.,  Nachmias,  1981)  and  apply  this  knowledge  to  the  acquisition  of  targets  in 
the  real  world. 

This  category  is  the  broadest  in  this  report,  largely  because  of  the  theoretical  distance  between 
physiology  on  one  hand  and  psychophysics  on  the  other.  The  reason  why  they  have  been 
grouped  together  is  that  both  attempt  to  extend  knowledge  of  how  the  visual  system  responds  to 
simple  stimuli  (as  detennined  by  studies  of  visual  physiology  of  psychophysics)  and  to  militarily 
relevant  stimuli.  Also,  physiological  models  are  constrained  in  that  they  must  conform  to  known 
psychophysics,  so  although  two  models  within  this  category  may  process  the  visual  information 
within  a  scene  very  differently  (one  by  analyzing  it  with  physiologically  based  filters  and 
transforms;  the  other  by  appealing  to  psychometric  functions),  their  result  may  be  identical. 

There  are  numerous  examples  of  this  type  of  model  (e.g.,  British  Aerospace  ORACLE6  model, 
Georgia  Tech  Vision  [GTV],  Wilson’s  Spatial  Vision  model,  and  the  cortex  transform-based 
distortion  metric).  These  models  tend  to  be  some  of  the  most  complex  of  all  target  acquisition 
models  because  their  bases  in  physiology  and  psychophysics  allow  them  to  incorporate  many 
factors  known  to  influence  human  perception  so  long  as  the  effects  of  the  factors  are  adequately 
understood. 

There  are  also  models  in  this  class  that  base  perception  on  the  interpretation  of  the  output  of 
physiological  mechanisms.  Models  of  this  kind  treat  the  pieces  of  interpreted  information  as 
“features”  or  components  of  objects  and  background  elements  in  the  scene.  Typically,  these 
models  are  geared  toward  a  basic  understanding  of  the  visual  system  and  do  not  constitute  full- 
scale  models  of  target  acquisition.  Examples  of  these  models  include  MIRAGE7  (Watt  & 
Morgan,  1985),  MIDAAS8  (Kingdom  &  Moulden,  1992),  and  various  vision  models  by 


6ORACLE  is  not  an  acronym. 

7The  acronym  MIRAGE  is  nothing  short  of  a  description  in  and  of  itself:  “Multiple  Independent  filters  of  various 
sizes  and  with  both  signs,  half-wave  Rectified  before  Averaging.  The  resultant  signals  are  Gated  between  adjacent 
zeroes  for  the  Extraction  of  the  primitive  code.” 

MIDAAS  stands  for  Multiple  Independent  Descriptions  Averaged  Across  Scale 


9 


Grossberg  and  colleagues  (e.g.,  Grossberg,  1997;  Grossberg,  Mingolla,  &  Ross,  1994).  These 
feature-based  models  are  quite  distinct  from  the  second  category  of  models. 

2.  Models  based  on  non-physiological  feature  extraction: 


optical/objective  1 - 1 

1 - 1  cognitive/subjective 

reductive  H _ 1 - 

- 1  comprehensive 

target-centered  H — 1 - 

- 1  situation-centered 

physiological  1 - 

1 — H  empirical 

individual  H - 

- H  ensemble 

These  models  base  their  predictions  on  the  extraction  of  specific  features  from  a  scene  rather 
than  on  an  observer’s  ability  to  extract  simple  visual  information.  As  was  the  case  for 
physiology-based  feature-extraction  models  in  the  previous  category,  the  extracted  features  are 
assumed  more  likely  to  be  properties  of  the  visual  signatures  of  military  targets  than  of  non¬ 
target  elements  in  the  scene.  However,  unlike  the  previous  class,  the  selection  of  the  features 
themselves  in  these  models  is  not  based  on  how  the  human  visual  system  is  known  to  function. 
Instead  of  appealing  to  simple  physical  stimuli  such  as  oriented  line  segments  (the  output  of 
early  cortical  visual  processing  [see  Hubei,  1988,  for  an  excellent  review  of  this  early  work])  as 
the  features  of  interest,  these  models  assume  that  visual  processing  depends  on  more  complex 
representations  not  having  a  direct  correspondence  to  early  visual  processing. 

Examples  of  such  models  include  the  edge-based  2  Vi-dimensional  representation  (Marr,  1982; 
Marr  &  Hildreth,  1980),  recognition  by  components  (RBC)  theory  (Biederman,  1987),  object 
symmetry  (Rosenfeld,  Wolfson,  &  Yeshurun,  1995),  Guided  Search  models  (Wolfe,  1994b; 
Wolfe  &  Gancarz,  1996),  search  by  recursive  rejection  (SERR)  (Humphreys  &  Muller,  1993), 
texture-based  search  (Nothdurft,  1991),  and  Feature  Integration  Theory  (Treisman  &  Gelade, 
1980;  Treisman  &  Sato,  1990). 

It  is  interesting  to  note  that  this  class  of  models  contains  the  greatest  preponderance  of  thinking 
from  perceptual  psychology.  The  reason  is  that  perceptual  psychology  has  traditionally 
attempted  to  speak  of  the  visual  world  in  terms  of  objects  (e.g.,  Duncan,  1984),  groups  (Vecera 
&  Farah,  1994),  surfaces  (Nakayama  &  He,  1994),  and  features  based  loosely  on  visual 
physiology  such  as  T-  and  L-junctions  (Biederman,  1987),  color  (Theeuwes,  1995),  etc.  Much 
progress  has  been  made  in  understanding  human  visual  search  by  the  use  of  this  reductionistic 
technique,  and  some  of  the  most  theoretically  sophisticated  information-processing  models  of 
vision  are  based  on  such  a  breakdown  of  the  scene. 

The  strength  of  the  non-physiological  feature  approach  is  that  the  models  have  good  agreement 
with  human  performance  in  the  laboratory  setting.  The  models  can  also  more  readily  use  the 
information  required  for  discrimination  judgments  because  they  ostensibly  concern  the  features 
that  the  visual  system  employs  to  form  such  judgments  and  because  the  models  arise  from  the 
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perceptual  psychology  community  where  models  of  judgment  and  decision  making  are  well 
developed. 

The  primary  limitation  of  these  models  is  obvious.  Because  they  were  developed  in  the 
laboratory  where  stimuli  are  reduced  to  their  presumably  most  basic  forms,  there  is  little 
evidence  that  most  models  can  be  applied  at  all  to  visual  processing  of  real-world  stimuli.  The 
primary  reason  for  the  lack  of  generalizability  is  that  the  real  world  cannot  simply  be  reduced  to 
a  set  of  basic  stimuli.  (If  it  can,  nobody  has  yet  figured  out  what  they  are!)  Some  attempts  to  try 
to  bridge  the  gap  between  the  lab  and  the  field  have  been  made  with  limited  success  (e.g.,  Wolfe, 
1994a). 

3.  Models  based  on  theoretical  constructs  and  scene  descriptions: 


optical/objective  HID - 

- 1  cognitive/subjective 

reductive  H _ 1 - 

- 1  comprehensive 

target-centered  H _ 1 

- 1  situation-centered 

physiological  1 - 

— 1 - H  empirical 

individual  H - 

- H  ensemble 

These  models  also  base  their  predictions  of  performance  on  the  presence  within  the  scene  of 
information  of  a  particular  type.  In  these  models,  however,  the  infonnation  does  not  take  the 
form  of  specific  features  or  combinations  of  features  but  rather,  a  less  theoretical  form. 

Generally  speaking,  the  more  such  information  is  present  at  the  target  location,  the  greater  the 
probability  or  possible  level  of  acquisition.  The  constructs  used  by  the  models  are  typically  one¬ 
dimensional  metrics  such  as  conspicuity  (e.g.,  Toet,  1996),  number  of  resolvable  cycles,  N,  of  a 
bar  pattern  (i.e.,  a  square  wave)  on  a  target  (Johnson,  1958),  or  complexity  (e.g.,  Tidhar  et  ah, 
1994).  Such  metrics  may  apply  to  the  location  of  the  target  only  or  they  may  apply  to  the  entire 
scene.  For  example,  unidimensional  clutter  metrics  can  be  global  (relating  to  the  entire  scene)  or 
local  (relating  only  to  a  small  region). 

The  logic  underpinning  these  theories  is  that  more  information  about  a  target  should  allow  a 
greater  proportion  of  observers  to  be  able  to  acquire  it.  Most  of  the  models  and  metrics  based  on 
these  constructs  are  used  for  predicting  ensemble  perfonnance.  Examples  of  models  in  this 
category  include  the  Johnson-criteria-based  models  from  NVESD,  FLIR92  (Scott  & 
D’Angostino,  1992)  and  ACQUIRE  (Tomkinson,  1990),  the  Bailey/Rand  search  model  (Bailey, 
1970),  metrics  of  clutter  and  its  inverse,  conspicuity  (e.g.,  Toet,  1996),  and  models  of  target 
distinctiveness  (Ahumada  &  Beard,  1996). 

The  strength  of  these  models  comes  from  their  simplicity  and  robustness.  The  metrics  often  used 
(e.g.,  resolvable  detail,  clutter)  have  stood  the  test  of  time  and  are  widely  used  as  predictors  of 
performance.  Clutter,  for  example,  is  known  to  influence  performance  very  strongly  and  in 
many  ways  (Akerman,  1993a  &  b).  In  addition,  new  models  of  this  sort  are  still  being  created 
and  have  predictive  validity  (e.g.,  Overington’s  1982  disk  discrimination  metric;  Bijl  & 
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Valeton’s  1998a  triangle  orientation  discrimination  metric).  These  new  metrics  are  discussed  in 
greater  detail  shortly. 

The  primary  weakness  is  based  on  the  facts  that  the  hypothesized  constructs  are  derived  solely 
from  the  scene  and  that  the  models  are  designed  around  ensemble  performance  rather  than 
individual  performance.  As  such,  there  may  be  limited  opportunity  to  add  observer  variables 
also  known  to  influence  perfonnance. 

4.  Models  based  on  largely  atheoretical  fits  to  empirical  data: 


optical/objective  HID - 

- 1  cognitive/subjective 

reductive  IQ - 

- 1  comprehensive 

target-centered  H _ \~ 

- 1  situation-centered 

physiological  1 - 

- UH  empirical 

individual  H - 

- H  ensemble 

There  is  a  relatively  uncommon  class  of  models  that  predicts  perfonnance  almost  entirely  by 
fitting  empirical  performance  data  from  previous  studies  to  a  set  of  parameters  measured  or 
controlled  in  those  studies.  Models  in  this  category  tend  to  be  older  (e.g.,  Bishop  &  Stollmack, 
1968,  and  Poe’s  model  [see  Bailey,  1970,  for  a  discussion  of  Poe  in  relation  to  other  models]). 

Empirical  models  have  few  strengths.  Their  fundamental  shortcoming  is  the  lack  of  theory 
underlying  the  selection  of  parameters  and  the  functions  that  the  parameters  are  to  fit.  As  such, 
although  a  curve  fit  through  a  set  of  data  points  for  one  study  may  be  quite  good,  the  curve  will 
not  be  generalizable  to  experiments  with  different  parameters.  Even  worse,  the  model  might  not 
be  able  to  fit  data  with  the  same  parameters  because  the  way  that  the  parameters  mapped  onto 
performance  in  one  study  may  not  take  into  account  any  third  variables  that  actually  drive 
performance  or  modulate  the  effects  of  parameters.  Thus,  even  though  the  situation  would  seem 
to  be  identical  to  the  first  study,  in  reality,  it  may  be  quite  different. 


5.  Classic  Modeling  Concepts 


Most  models  make  some  common  underlying  assumptions  or  are  based  on  a  few  fundamental 
phenomena.  This  section  of  the  review  discusses  those  assumptions  as  they  have  been 
incorporated  into  many  models.  Of  particular  interest  in  this  section  are  the  Johnson  criteria,  the 
ACQUIRE  model,  and  its  incorporation  into  a  recent  NVESD  search  model  (FLIR92). 

Across  many  current  models,  there  are  a  few  common  underlying  concepts.  The  instantiation  of 
the  concepts  in  the  models,  however,  differs  from  model  to  model.  Here,  the  concepts  and  basic 
instantiations  of  the  concepts  are  discussed.  The  following  five  concepts  have  been  identified  as 
being  basic  to  many  models. 
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5.1  The  Role  of  Contrast  and  Contrast  Threshold 

Central  to  all  these  ideas  is  that  infonnation  used  by  the  observer  must  be  observable.  That  is, 
the  infonnation  related  to  the  target  must  have  sufficient  contrast,  either  between  the  target  and 
the  background  or  within  the  target,  to  allow  the  visual  system  to  use  it.  The  contrast  threshold, 
Ct,  is  defined  as  the  intensity  of  a  stimulus  required  for  it  to  be  barely  detectable  with  some 
reliability  (usually  50%  or  75%).  It  is  typically  described  in  terms  of  a  lawful  relation  between 
the  area  of  the  target  (or  some  other  size-related  quantity)  and  its  intensity  that  holds  at  or  near 
threshold  called  Ricco’s  Law. 

Contrast  and  the  Johnson  criteria  (see  next  section  and  appendix  A)  are  intimately  related. 
Johnson  (1958)  found  that  detection  is  typically  afforded  when  a  single  cycle  or  less  (a  cycle 
being  defined  as  a  light  and  dark  bar  of  a  repeating  bar  pattern)  on  a  target  is  visible.  That  the 
requirement  for  detection  is  near  unity  (see  table  2)  is  consistent  with  the  idea  that  the  driving 
factor  behind  detection  may  be  modeled  by  signal-to-noise  ratio  (SNR)  or  contrast.  For  near¬ 
threshold  targets  (e.g.,  targets  with  a  small  AT  relative  to  their  background  support  viewed 
through  a  FLIR  sensor),  the  SNR  is  calculated  in  terms  of  a  threshold  SNR,  below  which  a  target 
is  not  visible  (Howe,  1993;  Johnson,  1958).  For  super-threshold  targets  (when  SNR  »  1),  the 
target  contrast  with  respect  to  its  immediate  background  is  the  crucial  quantity. 


Table  2.  Resolvable  cycles  across  critical  dimension  to  perform  50%  accurate  acquisition  (N50)  at 
particular  levels  of  target  acquisition 


Detection 

Orientation 

(classification) 

Recognition 

Identification 

1.010.25 

1.410.35 

4.010.8 

6.411.5 

Johnson  also  found  that  greater  levels  of  target  acquisition  could  be  afforded  when  a  greater 
number  of  cycles  within  a  target  are  detectable.  Once  again,  the  concept  of  contrast  comes  into 
play  in  that  these  internal  details  must  have  sufficient  contrast  with  their  background  to  be 
detectable. 

5.2  Johnson  (1958)  or  Johnson-like  Target  Information  Requirements  for  Levels  of 
Target  Acquisition  Performance 

Johnson  (1958)  found  that  ensemble  target  acquisition  performance  can  be  predicted  by  a 
determination  of  the  number  of  resolvable  bar  cycles  that  can  be  perceived  on  a  target  (a  quantity 
called  N).  (See  appendix  A  for  a  detailed  description  of  the  method  Johnson  used.)  Johnson 
found,  not  surprisingly,  that  the  ability  to  perform  increasing  levels  of  target  acquisition  (i.e., 
detection  A  classification  A  recognition  A  identification)  required  that  a  greater  number  of  bars 
be  resolvable.  The  resulting  “Johnson  criteria,”  the  amount  of  internal  detail  required  to  acquire 
a  target,  are  widely  cited  and  used  in  models  of  ensemble  performance  (e.g.,  ACQUIRE  and 
FLIR92).  Table  2  shows  the  findings  from  Johnson’s  original  study. 
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The  shape  of  the  function  describing  the  relationship  between  N  and  the  probability  of  detection 
is,  as  one  would  expect,  not  a  step  function  at  or  near  1.0  cycle.  Rather,  N50  describes  the 
corresponding  number  of  cycles  for  50%  ensemble  perfonnance  on  an  ogive-shaped  function 
called  the  target  transform  probability  function  (TTPF).  The  TTPF  maps  predicted  probability  of 
detection  (Pd)  for  the  ratio  of  N/N50.  For  detection,  the  TTPF  can  be  described  as  follows: 

(N/N50)e 
d  ~  l  +  (N/N50)E 

in  which  N  =  number  of  cycles  resolvable  on  the  target, 

N50  =  number  of  cycles  required  for  50%  of  observers  to  detect  the  target,  and 
E  =  2.7  +  0.7(N/N50) 

Note  that  the  TTPF  described  performance  at  the  ensemble  level  and  is  not  intended  to  predict 
within-subject  perfonnance  across  trials9. 

Johnson’s  original  idea  has  undergone  few  substantial  changes  since  its  first  publication, 
although  current  so-called  two-dimensional  (2-D)  extensions  of  the  criteria  take  into  account  the 
height  and  width  of  the  target  rather  than  simply  a  “critical”  dimension  (e.g.,  the  ACQUIRE 
model  is  based  on  such  an  approach). 

That  “information”  resolvable  about  a  target  should  drive  performance  as  a  unidimensional 
quantity  is  a  powerful  idea.  Recent  models  have  gone  about  determining  the  target-like 
information  in  the  scene  differently,  but  there  remains  a  central  requirement  that  a  given  amount 
of  target  information  is  needed  for  the  average  observer  to  acquire  the  target.  (See  the  following 
section  for  a  detailed  description  of  these  efforts.) 

5.3  The  “Classical  Approach”  to  Modeling  Search  and  Bailey’s  (1970)  Separability  of 
Time-Dependent  and  Time-Independent  Search  Processes 

The  so-called  “classical  approach”  to  search  modeling  was  first  put  forth  by  Bailey  (1970),  in 
which  probability  of  acquisition  in  search  is  a  product  of  independently  considered  time- 
dependent  and  time-independent  stages. 

Bailey  asserted  that  PR,  the  probability  of  acquiring  (recognizing  or  identifying)  a  target,  is  the 
product  of  Pi,  the  probability  that  a  single  glimpse  will  locate  the  target  region  of  a  scene,  P2,  the 
probability  that  if  the  target  is  viewed  foveally,  it  will  be  detected,  and  P3,  the  probability  that  if 
the  target  is  detected,  it  will  be  recognized  or  identified10: 


9Such  an  analysis  has  been  done,  however,  in  order  to  evaluate  the  kinds  of  errors  that  such  an  ensemble 
predictor  makes.  For  example,  Valeton  and  Bijl  (1995)  looked  at  individual  deviations  from  ensemble  predictions  in 
the  evaluation  of  the  Target  Acquisition  (TARGAC)  model  (which  bases  its  predictions  on  a  Johnson-type  model), 
and  Silk  (1997)  used  such  deviations  to  evaluate  whether  P  ,  is  a  biased  estimator  of  ensemble  performance. 

10Indeed,  this  description  of  target  acquisition  is  the  same  as  was  provided  in  the  definition  in  section  2. 
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Pr  =  Pi  X  P2  X  P3 


The  first  term.  Pi,  is  time  dependent  in  that  it  is  assumed  that  during  the  search  of  a  scene,  a 
glimpse  has  a  dwell  time  at  a  certain  location  and  a  certain  amount  of  time  between  fixations  for 
eye  movements.  Search  progresses  by  the  random  selection  of  locations  about  the  scene.  The 
cumulative  probability,  Pi(t),  that  a  saccade  will  land  sufficiently  close  to  a  target  within  time  t, 
is  described  as  the  first  arrival  time  of  a  Poisson  process: 

Pl(t)  =  \-e~t/TfOV 

in  which  tfov  =  the  mean  acquisition  time,  given  that  a  target  is  fixated. 

The  second  two  terms  are  independent  of  time  in  that  they  are  both  conditional  on  the  target 
having  been  fixated.  Bailey  (1970)  derived  separate  terms  for  P2  and  P3,  which  are  of  historical 
significance  only  (although  Ryll  [1962]  incorporated  the  effect  of  scene  clutter  into  the  P3  term). 
The  most  popular  search  models  in  use  today  incorporate  a  limiting  term,  Poo,  to  denote  that  even 
after  an  infinite  amount  of  time,  some  members  of  an  ensemble  of  observers  will  be  unable  to 
acquire  the  target. 

The  current,  widely  accepted  NVESD  models  ACQUIRE  (Tomkinson,  1990)  and  FLIR92  (Scott 
&  D’Angostino,  1992)  instantiate  this  asymptotic  term  as  the  product  of  P2  and  P3  and  use  the 
familiar  TTFP  as  the  limiting  term: 

PxP=P  _  (N/N50)e 
2  3  00  l  +  (N  /  N50)e 

in  which  N  =  number  of  cycles  resolvable  on  the  target, 

N50  =  N50  for  detection,  and 
£  =  2.7  +  0.7(A/A50) 

Thus,  the  entire  ACQUIRE  probability  prediction  equation  can  be  expressed  simply  as  a  function 
of  time  and  the  number  of  resolvable  cycles  on  target,  which  itself  is  a  function  of  target  area  and 
contrast: 

P(t)  =  Paa{ \-e"TFOV) 

The  average  target  detection  rate,  1/tfov,  is  related  to  target  information  available  and  required 
for  50%  ensemble  acquisition  (Howe,  1993): 

1  _  1  N 

tfov  6.8  N50 

The  theoretical  and  practical  shortcomings  of  this  model  are  discussed  in  various  sections  of  this 
report. 


15 


5.4  Clutter  and  Its  Impact  on  Performance 

Counter  to  the  assumption  underlying  the  Johnson  criteria,  merely  having  a  certain  amount  of 
target-related  information  available  in  the  scene  does  not  completely  determine  perfonnance. 

The  background  in  which  the  target  is  present  must  also  be  taken  into  account  when  one  is 
making  predictions  of  performance.  The  term  “clutter”  has  no  single  agreed-upon  definition.  It 
has  been  described  as  scene  complexity,  number  or  density  of  target-like  elements,  number  or 
density  of  objects,  and  overall  scene  “busyness”  and  has  been  quantified  as  any  of  several 
unitless  metrics  (e.g.,  signal-to-clutter  ratio  [SCR]).  What  can  generally  be  agreed  upon  is  that 
when  certain  kinds  of  terrain  (such  as  desert)  enable  better  target  acquisition  performance  than 
others  (such  as  partially  wooded)  when  viewed  optically,  it  is  presumed  that  the  driving  force  for 
this  difference  is  that  the  former  terrain  is  less  cluttered  (or  has  less  clutter)  than  the  latter.  What 
exactly  the  clutter  in  the  scenes  is  is  not  clear,  although  we  can  often  determine  it  subjectively 
“just  by  looking”  at  the  scene. 

Clutter  can  be  defined  either  locally  or  globally,  depending  on  the  metric  enlisted  to  describe  the 
scene.  As  stated  before,  certain  kinds  of  terrain  have  more  or  less  clutter,  in  general,  than  others. 
Likewise,  some  regions  within  a  given  scene  may  be  more  cluttered  than  other  regions.  This 
observation  is  obvious  since  terrain  is  rarely  unifonn  and  since  some  parts  of  a  scene  (such  as  an 
open  field)  can  quickly  be  searched  while  rocks  or  trees  surrounding  the  field  provide  for  a  more 
difficult  search  situation.  Typically,  local  clutter  metrics  appear  in  models  of  time-dependent 
search,  while  global  clutter  metrics  appear  in  models  of  pure  acquisition  (when  eye  movements 
are  not  needed  because  target  location  is  known  ahead  of  time)11. 

Clutter  is  known  to  adversely  affect  target  acquisition  perfonnance  at  several  levels.  The  impact 
of  clutter  on  the  Johnson  criteria  is  to  increase  the  number  of  resolvable  cycles  needed  to  acquire 
the  target  (e.g.,  Mazz,  1998).  The  effect  on  search  is  to  decrease  the  size  of  saccadic  eye 
movements  between  glimpses  (meaning  that  the  eccentricity  from  the  fovea  allowing  for 
effective  search  decreases),  and  to  increase  the  amount  of  time  spent  at  each  glimpse  location 
(e.g.,  Akennan,  1992,  1993a).  In  addition  and  lending  support  to  the  definition  of  clutter  as  the 
number  or  density  of  target-like  objects,  local  clutter  affects  where  eye  movements  will  occur. 
Fixations  tend  to  be  executed  to  “target-like”  regions  of  the  scene  and  not  to  locations  at  random. 
The  presence  of  many  target-like  objects  in  the  field  is  also  known  to  increase  the  false  detection 
probability  compared  to  when  there  is  relatively  little  clutter  (Schmieder  &  Weathersby,  1983). 

Clutter  is  discussed  in  more  detail  in  a  separate  section  of  this  report. 


1  'This  is  not  a  “hard-and-fast”  rule,  of  course. 
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5.5  Target  Acquisition  Models  Based  on  the  Decomposition  of  the  Scene  Into  Oriented 
Spatial  Frequency  Channels 

Models  based  on  a  spatial  frequency  analysis  of  a  scene  assume  that  visual  perception  is 
mediated  by  an  array  of  spatially  tuned  pathways.  Each  pathway  responds  selectively  to  a  band 
of  spatial  frequencies  at  a  particular  orientation  and  located  at  a  particular  position  on  the  retina 
(i.e.,  corresponding  to  a  particular  position  in  the  field  of  view).  Infonnation  from  these 
channels  fonns  the  building  blocks  of  all  visual  percepts,  including,  of  course,  those  of  the  target. 

Justification  for  modeling  the  visual  system  with  a  set  of  oriented  spatial  frequency  channels 
comes  from  a  variety  of  sources.  First,  Hubei  and  Wiesel’s  Nobel  prize-winning  research  (e.g., 
1962,  1968)  into  the  nature  of  cortical  visual  processing  indicates  that  the  receptive  fields  of 
neurons  in  early  visual  cortex  (V 1  and  V2)  seem  to  be  sensitive  to  the  presence  of  oriented  line 
segments  but  largely  insensitive  to  the  presence  of  dots  of  light12.  Second,  the  shape  of  the 
human  contrast  sensitivity  function  (the  contrast  threshold  as  a  function  of  spatial  frequency)  and 
the  selective  adaptation  of  parts  of  the  function  can  be  explained  elegantly  by  the  summation  of 
overlapping  contrast  sensitivities  of  a  set  of  narrowly  selective  functions  that  varies  over  spatial 
frequency  (Campbell  &  Robson,  1968). 

In  order  to  model  a  visual  system  based  on  selective  sensitivity  to  spatial  frequency,  it  is 
necessary  to  determine  how  many  different  frequency-  and  orientation-selective  filters  are 
required  to  define  a  wide  variety  of  stimuli.  The  term  “channel”  is  used  to  describe  a  mechanism 
that  is  maximally  responsive  to  patterns  of  light  of  a  certain  spatial  frequency  and  orientation. 

Although  there  are  theoretically  180  degrees  of  orientation  and  about  three  logarithm  units  of 
spatial  frequency  to  which  humans  can  respond  within  any  orientation,  a  relatively  small  number 
of  channels  suffices  to  completely  describe  our  percepts.  Richards  and  Polit  (1974)  used  a 
metameric  texture-matching  task  to  determine  that  one-dimensional  textures  can  be  described 
completely  with  only  four  channels.  Metamers  are  two  stimuli  that  differ  physically  but  are 
perceived  to  be  identical  to  each  other.  The  existence  of  a  metamer  in  a  sensory  modality 
implies  that  either  the  receptors  in  that  modality  cannot  transduce  the  aspect  of  the  stimuli  that 
distinguish  them  or  the  nervous  system  cannot  encode  the  stimuli  as  being  different  from  each 
other.  Richards  and  Polit  found  that  any  two  textures  that  evoked  the  same  responses  along  these 
four  channels  were  perceived  to  be  identical,  regardless  of  their  actual  spatial  frequency  content. 
In  two  dimensions  (expressed  in  polar  coordinates),  Wright  and  Jemigan  (in  Akennan,  1993a) 
used  a  similar  method  to  detennine  that  42  channels  (6  radial  and  7  theta  oriented)  completely 
defined  all  the  textures  in  their  study.  More  pertinent  to  the  modeling  of  the  perception  of 
objects  by  spatial  frequencies,  Vol,  Pavlovskaja,  and  Bondarko  (1990)  found  that  objects  with 
similar  spatial  frequency  profiles  tended  to  be  more  confusable  than  objects  with  disparate 


12That  the  neurons  in  VI  and  V2  are  highly  sensitive  to  sharp  edges  is  not  inconsistent  with  a  spatial  frequency 
interpretation  of  vision  because  such  sharp  edges  approximate  delta  or  step  functions,  which  decompose  into  all 
wavelengths  by  Fourier  transform.  Thus  such  an  edge  should,  in  theory,  affect  all  properly  oriented  spatial 
frequency  channels. 
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spatial  frequency  profiles.  This  result  indicates  that,  at  some  level,  the  visual  system  seems  to 
compute  the  multi-dimensional  distance  between  combinations  of  spatial  frequencies  to  evaluate 
their  similarity.  In  terms  of  target  recognition,  then,  if  the  images  of  two  targets  do  not  differ 
greatly  in  their  spatial  frequency  signatures  (e.g.,  an  M60  and  a  T-72  tank  viewed  at  a  distance), 
then  they  should  be  difficult  to  distinguish. 

That  a  relatively  small  number  of  channels  may  completely  determine  a  percept  means  that  a 
model  may  be  able  to  use  these  few  channels  as  a  set  of  feature  detectors  to  extract  perceptually 
important  information  from  the  scene.  Operations  can  then  be  performed  on  the  output  of  the 
channels  in  order  to  detennine  what  the  original  image  must  have  been  to  have  precipitated  the 
activations13. 

Two  classes  of  models  have  used  the  Fourier  decomposition  of  scenes  in  constituent  spatial 
frequency  information.  One  class  of  models  performs  the  decomposition  with  the  hope  of 
finding  information  within  the  spatial  frequency  representation  of  the  scene,  which  would  come 
from  the  Fourier  decomposition  of  a  target.  The  assumption  of  these  models  is  that  a  given 
target  will  have  a  spatial  frequency  profile  that  will  stand  out  from  that  of  the  scene,  and  thus  by 
monitoring  particular  channels,  a  model  can  detect  the  target.  Additionally,  because  fine  spatial 
detail  resides  at  high  spatial  frequencies,  the  presence  of  such  infonnation  may  indicate  that  a 
higher  level  of  target  acquisition  may  be  possible.  These  models  assume  that  the  human  visual 
system  itself  may  be  monitoring  spatial  frequency  channels  when  it  searches  for  a  target. 

The  second  class  of  spatial  frequency  models  is  a  subset  of  more  general  purpose  human 
perception  models  that  uses  a  Fourier  decomposition  of  the  scene  as  a  “front  end”  for 
information  feeding  into  the  visual  system.  However,  this  second  class  of  models  then  uses  the 
information  (in  the  fonn  of  channel  strengths)  as  features,  which  are  then  combined  into  higher 
order  percepts  such  as  junctions,  surfaces,  and  solids.  This  class  of  models  tends  to  be  more 
theoretically  driven  and  typically  comes  from  the  realm  of  the  perceptual  psychology.  Examples 
include  Wolfe’s  Guided  Search  3  (Wolfe  &  Gancarz,  1996)  and  Grossberg,  Mingolla,  and  Ross’s 
(1994)  model  of  surfaces,  edges,  and  attention. 


6.  Classic  Modeling  Concepts  Revisited 


Recent  work  in  modeling  has  either  augmented  or  attempted  to  replace  the  classic  concepts. 
Efforts  to  incorporate  new  factors  into  old  models,  and  challenges  to  the  underpinnings  of  the  old 
models  are  presented.  The  limitations  of  the  classic  concepts  are  discussed. 

13  An  interesting  perspective  of  what  the  visual  system  actually  does  comes  from  the  fact  that  the  brain’s  task  is  to 
try  to  determine  the  probability  of  an  object  being  in  the  visual  scene,  given  the  stimulation  along  the  visual 
pathway.  This  observation  is  often  overlooked  by  scientists  attempting  to  determine  the  visual  system’s  response  to 
a  stimulus.  The  two  things  are  the  opposite  conditional  probabilities  of  each  other  (P(stimulus|response)  versus 
P(response|stimulus))  and  are  in  fact  quite  different  (Reike  et  ah,  1997). 
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6.1  Contrast  Revisited 

Contrast,  like  the  various  metrics  proposed  as  alternatives  to  bar  cycles  on  target  from  the 
Johnson  (1958)  criteria,  is  a  one-dimensional  quantity.  It  is  typically  assumed  to  vary  according 
to  the  observer’s  contrast  sensitivity  function  relating  the  required  contrast  between  an  object  and 
its  background  (if  both  are  uniform  and  untextured)  in  order  to  detect  a  target.  Detennining  the 
contrast  threshold,  Ct,  for  real-world  situations  requires  taking  into  account  factors  such  as  the 
reliability  of  detections  (e.g.,  whether  Ct  is  a  50%  or  75%  threshold),  the  retinal  eccentricity  of 
the  target,  the  size  of  the  target,  its  shape  if  it  differs  greatly  from  a  1 : 1  height-to-width  ratio,  its 
hue,  and  the  observer’s  level  of  dark  adaptation,  to  name  a  few. 

The  concept  of  contrast,  as  a  single  quantity  indicating  to  a  large  degree  the  ease  with  which  a 
target  can  be  detected,  has  a  number  of  problems.  First,  it  fails  to  take  into  account  various 
psychophysical  findings  that  may  be  relevant  to  target  acquisition  performance  in  the  field.  For 
example,  it  is  known  that  a  non-unifonn  target  against  a  uniform  background  is  more  detectable 
than  a  uniform  target  against  a  uniform  background  (Akennan,  1992). 

Second,  contrast  is  a  local  phenomenon  and  as  such,  cannot  address  issues  related  to  the  global 
scene  such  as  clutter  or  highly  salient  events  in  other  portions  of  the  visual  field.  It  is  known,  for 
example,  that  transient  events  in  the  periphery,  even  when  known  to  be  irrelevant,  can  render 
some  objects  difficult  to  detect  (O’Regan,  Rensink,  &  Clark,  1999).  In  these  cases,  the  contrast 
of  the  target  may  far  exceed  what  would  be  required  for  detection  in  the  absence  of  the  transient, 
yet  it  remains  undetectable14.  More  details  of  this  effect  from  perceptual  psychology  and  its 
possible  relevance  to  military  target  acquisition  are  discussed  next. 

Third,  the  flip  side  of  irrelevant  transients  reducing  the  effective  contrast  of  a  target  is  the  finding 
that  a  transient  occurring  at  the  target  location  or  motion  of  the  target  can  render  the  target  more 
visible  than  it  would  otherwise  be  (Mazz,  Kistner,  &  Pibil,  1998;  Nakayama  &  Mackeben, 

1989).  Search  models  that  incorporate  motion  tend  not  to  adjust  contrast  threshold  downward, 
however;  they  tend  to  change  Pi  to  make  it  more  likely  that  a  target  is  localized  in  a  single 
glimpse15.  This  technique,  of  course,  is  empirically  rather  than  theoretically  motivated. 

Fourth,  contrast  sensitivity  is  itself  dependent  on  temporal  aspects  of  the  scene  or  display  as  well 
as  light  adaptation  of  the  observer  and  retinal  eccentricity,  making  its  use  as  a  single  constant 
quantity  related  to  a  target  somewhat  questionable.  Few  Johnson  criteria-based  models 
incorporate  this  level  of  detail  into  their  discussions  of  contrast.  Models  based  on  visual 


14Studies  in  perceptual  psychology  that  relate  to  transients  and  the  transient  capture  of  attention  use  the 
unidimensional  term  “salience”  rather  than  contrast.  In  the  luminance  domain,  it  may  be  argued  that  the  terms  may 
be  used  interchangeably. 

1 5In  such  models,  motion  may  increase  the  size  or  characteristics  of  the  hard  or  soft  shell  search  lobe  so  that 
targets  of  greater  eccentricity  from  fixation  are  detectable.  Though  such  a  change  is  consistent  with  an  increase  in 
target  contrast,  it  is  not  specified  as  such  in  the  models. 
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physiology  and  psychophysics,  however,  are  more  likely  to  include  these  details  into  the  model 
front  ends  (see  imminent  section  on  psychophysical  and  physiological  models). 

Fifth,  the  contrast  threshold  below  which  a  target  cannot  be  acquired  is  not  simply  a  function  of 
the  physical  stimulus  and  adaptive  state  of  the  observer.  Blackwell  (1958  in  Akerman,  1993a) 
lists  several  factors  and  how  threshold  contrast  should  be  adjusted  (always  increased)  to  account 
for  them.  His  results  are  summarized  in  table  3. 


Table  3.  The  effect  of  various  factors  on  target  detection  contrast  threshold  (Ct) 


Factor 

Multiplier  to  CT 

Uncertain  frequency  of  occurrence  (lack  of  vigilance) 

1.19 

Uncertain  location 

1.31 

Uncertain  occurrence 

1.40 

Uncertain  size  and  occurrence 

1.50 

Uncertain  occurrence  and  duration 

1.60 

Trained  versus  naive  observers 

1.90-2.00 

Non-foveal  target  location 

2.78 

Note  that  all  these  factors,  with  the  possible  exception  of  the  last  one,  are  related  to 
psychological  variables.  That  such  factors  can  so  drastically  change  threshold  contrast,  yet  are 
not  included  in  models  or  are  accounted  for  by  appealing  to  a  group  of  “trained  military 
observers,”  indicates  a  lack  of  psychological  sophistication  and  a  clear  case  for  the  need  to 
investigate  how  psychological  factors  influence  performance. 

6.2  Rethinking  the  Johnson  Criteria 

Although  widely  used  and  a  good  indicator  of  ensemble  perfonnance,  the  Johnson  criteria  are 
not  without  their  problems.  It  is  instructive  to  recall  the  kind  of  stimuli  Johnson  used  in  his 
initial  study  (see  appendix  A  for  details  of  his  methods):  bar  patterns  of  uniform  contrast  against 
a  uniform  background.  Such  stimuli  are  obviously  unrealistic,  given  that  target  and  background 
characteristics  vary  greatly  in  the  field.  For  example,  using  the  results  of  Johnson's  study  to 
predict  detectability  of  targets  in  a  realistic  setting  requires  N50s  needed  for  various  levels  of 
acquisition  to  be  increased,  indicating  that  the  criteria  must  be  at  least  partly  determined  by 
particulars  of  the  situation.  Recall  also  that  clutter  is  known  to  increase  N50  across  the  board. 

6.2.1  Other  Issues  Related  to  the  Johnson  Criteria 

These  issues  hinge  on  a  more  realistic  representation  of  real-world  target  acquisition  situations. 

6.2. 1.1  Non-uniform  Information  in  Targets  (i.e.,  targets  with  large  regions  of  little  detail) 

This  problem  arises  from  our  attempting  to  apply  the  Johnson  criteria  to  a  wider  variety  of 
targets  than  were  considered  at  the  time  of  their  inception.  Regions  of  certain  targets,  such  as 
ships,  have  relatively  little  detail  and  thus  contribute  little  to  our  recognizing  or  identifying  the 
target.  Other  regions  of  the  same  target  contain  the  critical  details.  Since  the  Johnson  criteria 
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depend  on  the  area  and  the  cycles  across  a  critical  target  dimension,  it  seems  obvious  that  area  of 
the  target  alone  is  not  a  good  indication  of  the  information  therein  (Moser,  1972). 

Moser  proposed  that  instead  of  using  area  and  resolvable  cycles  to  detennine  the  information  in  a 
target,  the  resolvable  perimeter  or  the  smallest  resolvable  perimeter  element  (i.e.,  a  convex  or 
concave  region)  would  be  a  better  indicator  of  performance.  Work  by  Kennedy  (1983)  has  led  to 
the  adoption  of  the  square  root  of  the  area  rather  than  simply  the  area  when  one  is  calculating  N 
as  a  partial  solution  to  the  difficulties  associated  with  using  raw  area.  Overington  (1982) 
suggested  a  similar  approach  to  how  recognition  should  be  modeled.  He  proposed  that  detection 
performance  (that  is  not  biased  by  aspect  ratio)  can  be  predicted  by  an  equivalent-size  disk 
detection  task,  and  that  identification  can  be  predicted  by  a  disk  discrimination  task  where  the 
size  of  the  disk  in  question  was  a  fraction  of  the  diameter  of  the  target.  Overington  incorporated 
the  psychophysical  function  relating  disk  discrimination  and  acquisition  performance  into  an 
early  version  of  the  ORACLE  model  (see  appendix  A  for  details  of  the  current  ORACLE  model). 

6.2. 1.2  Anisotropic  Targets  (i.e.,  targets  that  appear  vastly  different  when  viewed  from  different 
angles) 

It  is  plainly  apparent  that  most  every  target  of  interest  is  anisop tropic.  Johnson  and  Lawson 
(1974)  noted  that  many  targets  are  more  difficult  to  recognize  from  the  front  than  from  the  side. 
(For  example,  envision  an  M-2  Bradley  and  an  Ml  tank  from  the  front  and  the  side.  There  is 
clearly  more  distinguishing  detail  available  from  a  side  view  of  the  vehicles.)  The  authors  found 
that  N50  for  recognition  of  ground  vehicles  increased  by  as  much  as  30%  when  viewed  from  the 
front.  At  intermediate  aspects,  however,  performance  remained  relatively  good  as  long  as  the 
details  visible  from  a  side  view  were  still  visible.  This  observation  is  very  similar  to  how  the 
RBC  theory  (Biederman,  1987)  postulates  that  humans  recognize  objects.  This  theory  is 
discussed  shortly. 

The  effect  of  aspect  has  also  been  demonstrated  to  interact  with  the  aspect  ratio  of  the  potential 
target.  The  increase  in  N50  as  a  function  of  aspect  is  even  more  pronounced  for  targets  that  have 
a  large  length-to-width  ratio,  such  as  a  ship.  In  this  situation,  N50  increased  by  as  much  as  500% 
from  the  side  to  the  front  view  (Johnson  &  Lawson,  1974;  Ratches  et  al.,  1973,  in  Howe,  1993). 
Thus,  the  Johnson  criteria  can  no  longer  be  considered  a  function  of  the  level  of  target 
acquisition  alone  but  must  also  incorporate  target  dimensions  and  aspect. 

A  different  way  to  characterize  target  information  within  a  Johnson-like  framework  (i.e.,  a  set  of 
criteria  determining  the  amount  of  infonnation  required  to  acquire  a  target  at  different  levels)  is 
to  use  metrics  other  than  N.  Several  such  metrics  have  been  defined  and  validated  that  do  not 
depend  explicitly  on  aspect.  These  metrics  are  said  to  have  been  validated  in  that  they  produce 
reliable  criteria  for  each  level  of  acquisition,  similar  to  the  Johnson  criteria  of  a  certain  N  for 
each  level  of  acquisition. 
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As  already  mentioned,  Moser  has  proposed  that  target  information  be  a  function  of  perimeter 
while  Overington  (1982)  proposed  that  a  detectable  or  discriminable  disk  size  be  used. 
Blumenthal  and  Campana  (1981,  1983)  proposed  that  image  quality  (operationally  detennined 
by  the  function  of  the  inverse  of  the  size  of  a  barely  detectable  circle  or  square)  be  a  metric  for 
determining  information  about  a  target.  Moser  (1972)  proposed  an  area-based  metric  (which  he 
subsequently  questioned)  in  which  information  is  a  function  of  the  number  of  pixels  on  a  target 
required  for  acquisition  at  various  levels.  Similarly,  O’Neill  (1974,  in  Howe,  1993)  detennined 
that  Moser’s  number-of-pixels-on-target  metric  can  be  extended  from  silhouette  images,  used  in 
Moser’s  study,  to  TV  images. 

A  recent  metric  proposed  by  Bijl  and  Valeton  (1998a)  involves  the  contrast  required  to 
discriminate  the  orientation  of  an  equilateral  triangle.  The  underlying  assumption  of  the  triangle 
orientation  discrimination  (TOD)  metric  is  that  if  a  subject  can  reliably  determine  the  orientation 
of  a  triangle  of  a  dimension  and  contrast  similar  to  a  target,  then  he  should  also  be  able  to 
discriminate  the  target.  The  critical  dimension  in  the  TOD  metric  is  the  square  root  of  its  area. 
That  is,  if  a  triangle  and  target  have  the  same  square  root  area,  the  probability  of  ensemble 
acquisition  should  vary  together  as  a  function  of  contrast. 

Bijl  and  Valeton  (1998b)  validated  the  TOD  metric  against  the  cycles-on-target  metric  in  the 
ACQUIRE  model.  ACQUIRE  is  used  to  predict  the  acquisition  range  for  targets  of  a  particular 
size  and  contrast.  By  comparing  data  about  the  discriminability  of  triangle  orientations  to  data 
related  to  cycles  on  target  and  detection  range,  the  authors  found  that  (a)  the  TOD  metric  was  a 
better  predictor  of  acquisition  range  than  ACQUIRE,  and  (b)  the  TOD  metric  is  less  susceptible 
to  the  aspect  of  targets,  including  ship  targets  known  to  have  a  large  effect  on  N50. 

6.2. 1.3  The  Reliance  on  a  Single  Quantity  (e.g.,  cycles  on  target)  to  Determine  Performance 

One  problem  with  the  previously  mentioned  models  that  base  performance  predictions  on  the 
amount  of  information  that  can  be  derived  from  the  target  is  the  selection  of  a  single  aspect  of  the 
target  that  best  captures  the  information  content  of  the  target.  Area,  resolvable  cycles,  perimeter, 
equivalent  disk,  square,  and  triangle  size  all  capture  some  aspect  of  the  target’s  infonnation. 
However,  it  is  likely  a  mistake  to  assume  that  all  observers  use  the  same  source  of  target 
information.  How  then  can  the  Johnson  criteria  be  made  to  use  more  information? 

As  an  example  of  a  single  metric  that  accounts  for  more  than  one  aspect  of  a  target,  Akennan 
and  Lucius  (1990)  defined  the  “useful  area”  as  a  function  of  both  perimeter  and  area.  Useful 
area  is  defined  as  the  portion  of  the  radius  inward  from  the  edge  of  an  object’s  perimeter,  which 
is  to  be  used  for  assessing  target  acquisition  performance.  This  metric  has  been  incorporated  into 
Akerman’s  visual  observer  model  (VOM)  (1992,  1993b).  This  technique  of  combining  two 
largely  independent  features  is  a  possible  solution  to  the  problem  of  selecting  the  one  dimension 
most  important  for  expressing  the  information  content  in  a  target.  Physiological  models  and 
newer  fuzzy  logic  models  (discussed  later)  use  this  property. 
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6.2. 1.4  The  Relative  Importance  of  Some  Features  Compared  to  Others 

The  concept  that  target  information  required  for  recognition  or  identification  is  related  to  the 
number  of  cycles  resolvable  on  a  target,  and  not  what  those  cycles  represent,  is  clearly  a 
generalization.  “More  information”  implies  that  some  of  it  will  likely  be  useful  for  discrimination 
performance,  although  the  nature  of  that  information  is  not  clear.  Johnson  and  Lawson’s  (1974) 
observation  that  N50  for  anisotropic  targets  reaches  a  relatively  stable  minimum  at  aspects  that 
include  portions  of  the  side  view  (e.g.,  a  front  left  aspect  angle)  indicates  that  as  soon  as  features 
of  an  object  are  visible  (and  themselves  discriminable,  of  course)  object  recognition  can  proceed 
relatively  independently  of  viewing  angle.  Thus,  there  may  be  critical  details  that,  once  visible, 
determine  performance.  This  may  be  particularly  true  for  targets  that  are  easily  confusable,  such 
as  a  T-62  and  T-72  tank.  In  a  case  such  as  this,  the  presence  or  absence  of  a  single  detail  may  be 
required  for  us  to  discriminate  between  the  two.  Should  such  a  detail  be  small,  the  Johnson  criteria 
for  the  discrimination  would  likely  be  quite  large  in  that  the  size  of  a  cycle  on  the  target  must  be  as 
small  as  the  critical  detail.  The  Johnson  criteria,  therefore,  may  be  predictive  but  not  very 
informative  of  the  information  that  the  observer  uses  to  make  a  decision. 

A  popular  model  from  perceptual  psychology  is  Biederman’s  (1987)  (see  appendix  A)  RBC 
theory,  which  states  that  recognition  of  objects  requires  details  (i.e.,  component  geometric  forms, 
called  “geons”  in  the  theory)  of  the  object  to  be  extractable  from  the  image.  If  the  aspect  of  the 
target  is  such  that  only  a  subset  of  the  geons  can  be  extracted  (because  others  are  not  visible),  then 
the  object  cannot  be  recognized  definitively.  In  such  cases,  the  observer  uses  the  infonnation 
available  and  perfonns  the  highest  level  acquisition  decision  possible — a  classification  or  a 
recognition  rather  than  an  identification. 

O’Kane,  Biederman,  Cooper,  and  Nystrom  (1997)  determined  that  the  confusability  between 
various  military  ground  and  air  vehicles  in  a  recognition  task  can  be  explained  by  an  RBC-type 
model.  The  authors  found  that  when  particular  features  were  obscured  or  not  visible  because  of 
viewing  angle,  observers  made  errors  in  a  manner  consistent  with  their  checking  an  internal 
representation  based  on  the  presence  and  configuration  of  basic  geometric  components  of  the 
objects. 

Marr  and  Hildreth  (1980)  and  Marr  (1982)  also  modeled  the  process  of  recognition  by  asserting 
that  objects  in  a  scene  are  decomposed  into  a  set  of  geometric  primitives.  Their  approach  was 
more  computationally  based  than  (and  used  a  different  mental  representation  of  objects  than)  that 
of  Biederman.  However,  a  common  fundamental  aspect  of  the  model  is  that  it  required  the 
image  to  contain  visual  information  sufficient  to  decompose  it  into  its  constituent  primitives  for 
recognition  to  take  place. 

Both  RBC  and  Marr’s  theories  differ  from  all  the  Johnson-like  metrics  and  models  in  that  the 
identity  of  constituent  object  components  and  not  the  quantity  of  infonnation  (however  defined) 
determines  identification  performance. 
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6.3  The  Bailey  (1970),  the  Classical,  and  the  Neoclassical  Search  Frameworks 

All  models  of  search  must  specify  three  aspects  of  the  dynamic  search  process:  search  lobe  type 
and  size,  fixation  location  selection,  and  whether  over-searching  is  permitted.  Search  models  all 
assume  that  a  fixation  must  occur  near  a  target  in  order  for  the  target  to  be  acquired.  The 
distance  required  for  acquisition  need  not  define  a  hard  “cut-off’  between  detectability  and 
undetectablity,  however.  The  visual  lobe  is  defined  as  a  set  of  probability  contours  that  map  the 
probability  of  acquiring  the  target  at  various  eccentricities  from  the  point  of  fixation.  The  shape 
of  the  function  can  be  a  step,  indicating  that  no  acquisition  can  occur  after  some  eccentricity  (and 
usually  that  there  is  equal  probability  of  acquisition  within  that  eccentricity)  or  a  continuous, 
decreasing  function  of  eccentricity.  Models  assuming  the  former  are  said  to  perform  “hard  shell” 
search;  models  assuming  the  latter  are  said  to  perform  a  “soft  shell”  search.  There  are  also  rare 
models  (e.g.,  Georgia  Tech  Vision,  discussed  later)  that  require  a  target  to  be  fixated  directly 
before  a  detection  can  be  made.  In  addition  to  how  close  to  a  target  a  fixation  must  fall,  search 
models  must  also  define  how  the  locations  of  fixations  are  generated.  Some  models  assume 
random  selection  with  replacement,  some  assume  random  selection  without  replacement,  and 
some  assume  guidance  to  target-like  regions  of  the  scene.  Finally,  models  must  also  specify 
whether  targets  can  be  fixated  more  than  once  without  being  detected  or  eliminated  from 
consideration. 

In  the  instantiation  of  the  Bailey  framework,  some  assumptions  must  be  made  regarding  how  the 
time-dependent  search  operation  is  conducted.  For  example,  selection  of  glimpse  locations  is 
typically  considered  to  be  random  sampling  with  or  without  replacement.  Also  inherent  in  the 
selection  of  glimpse  locations  is  the  selection  of  the  visual  lobe.  As  discussed  next,  scenes  will 
vary  greatly  as  to  the  location  of  eye  movements  and  distance  moved  in  terms  of  the  background 
and  anticipated  targets.  Glimpse  durations  are  usually  assumed  to  be  constant  and  independent 
of  clutter,  which  is  not  necessarily  the  case.  Clutter  is  known  to  increase  dwell  time  (Akennan, 
1992),  indicating  that  Pi  may  actually  depend  on  processes  involved  in  Poo. 

Two  recent  models  that  are  based  loosely  on  Bailey’s  logic  but  include  more  factors  known  to  be 
involved  with  search  performance  are  the  Visual  Detectability  Model  (VIDEM)  (Akennan  & 
Kinzly,  1979)  and  VOM  (Akennan,  1992,  1993b).  The  most  notable  additions  to  the  Bailey 
design  are  the  effects  of  clutter  (see  appendix  A  and  the  section  on  clutter  and  conspicuity  for 
more  details)  and  the  (optional)  effect  of  display  noise  (VOM  version  1.2,  Akennan,  1993b). 
Display  noise  is  represented  by  a  final  term,  P4,  the  probability  of  discriminating  a  target  that  has 
been  fixated  and  detected,  given  the  SNR  inherent  in  the  display  upon  which  the  target  may  be 
presented  to  the  observer.  Therefore,  in  the  final  model,  P  =  Pi  x  P2  x  P3  x  P4.  It  may  be 
instructive  to  note  that  the  independence  assumption  makes  the  combination  of  P3  and  P4 
possible  and  that  no  other  model  has  separated  this  last  term.  The  VOM  is  interesting  in  that 
although  it  uses  a  clutter  metric  (Waldman’s  SCR,  discussed  later)  to  alter  glimpse  time,  it  still 
uses  a  random  selection  of  fixation  locations. 
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The  Bailey  search  step,  as  defined  by  Pi,  the  probability  to  fixate  on  the  target  in  a  single 
glimpse,  assumes  that  the  duration  of  the  fixated  eye  movement  is  sufficiently  long  to  allow  for 
complete  spatial  sum  of  the  stimulus.  Spatial  summation,  which  is  only  nearly  complete  for 
relatively  small  stimuli,  requires  between  50  and  200  milliseconds  to  take  place  (Howe,  1993). 
The  time  required  to  sum  stimuli  should  certainly  have  an  effect  on  the  observer’s  decision  as  to 
the  presence  of  a  target,  yet  models  tend  to  keep  glimpse  duration  constant. 

Self  (1969,  in  Akerman,  1993a)  summarized  five  aspects  of  eye  movements  in  real-world  visual 
search,  which  make  their  prediction  problematic: 

When  a  target  is  not  found  quickly,  the  observer  tends  to  re-search  areas  of  the  scene  he  thinks 
are  likely  to  contain  the  target  while  ignoring  other  areas  of  the  scene  which  he  thinks  are 
unlikely  to  contain  the  target.  Although  knowledge  of  the  target  and  where  it  is  likely  to  appear 
could  be  helpful  in  many  situations  (and  thus  the  justification  for  training  the  Soldier  as  to 
common  concealment/placement  methods),  such  dependence  on  where  a  target  ought  to  appear 
could  lead  a  Soldier  to  miss  a  target  that  is  in  an  unexpected  location. 

This  behavioral  finding  is  in  good  agreement  with  a  recent  result  by  Chun  and  Wolfe  (1996)  that 
shows  that  subjects  use  different  criteria  for  rendering  a  target  present/absent  judgment:  when  a 
target  is  located,  search  stops  (as  one  may  expect  it  to).  When  a  target  is  not  located,  subjects 
employ  a  “conservative  quitting  criterion”  and  will  over-search  the  scene  until  a  more  restrictive, 
task-dependent  criterion  for  the  target  not  being  present  is  met. 

The  finding  also  indicates  that  cognitive  processes  related  to  knowledge  of  likely  target 
characteristics  and  capabilities  and  possibly,  familiarity  with  strategy  and  terrain  types,  has  a 
strong  influence  on  perfonnance.  Presumably,  there  should  be  a  strong  effect  of  training  on  this 
kind  of  behavior. 

a.  Most  subjects  first  perform  a  cursory  scan  of  the  scene  for  the  target  before  beginning 
any  kind  of  systematic  (trained  or  instructed)  scan. 

b.  Targets  closer  to  the  center  of  the  FOV  tend  to  be  detected  more  rapidly  than  those  of  the 
periphery. 

This  finding  agrees  with  recent  work  in  attention  deployment  in  difficult  (conjunction)  search  by 
Carrasco,  Evert,  Chang,  and  Katz  (1995).  The  authors  showed  that,  all  things  being  equal, 
subject  performance  was  faster  and  more  accurate  for  detection  of  targets  close  to  fixation. 

Given  that  a  subject  will  likely  begin  perusal  of  a  scene  somewhere  near  the  center,  these  results 
may  be  applicable  to  Selfs  observations. 

a.  Putting  time  pressure  on  the  subject  can  lead  to  faster  searching  (i.e.,  shorter  glimpse 
duration)  without  a  loss  in  accuracy. 
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b.  There  are  large,  consistent  individual  differences  between  subjects  related  to 
performance.  Some  subjects  are  consistently  faster  and  more  facile  at  searching  than 
others. 

Although  this  point  does  not  pose  a  specific  problem  for  models  based  on  Bailey,  since  these 
models  predict  ensemble  performance,  it  means  that  less  of  the  variance  within  a  study  will  be 
captured  by  the  situational  variables  of  interest  (e.g.,  N50). 

In  addition  to  Selfs  observations,  other  researchers  have  observed  two  additional  aspects  of  eye 
movements  that  models  must  be  able  to  address  (e.g.,  Nicoll  &  Hsu’s,  1995,  analysis  of  field 
data  from  O’Kane,  Walters,  &  D’Angostino,  1993): 

c.  Observers  routinely  visit  the  target  many  times  before  declaring  a  detection  of  the  target. 

d.  Observers  continue  to  visit  non-targets  and  the  target  after  detecting  the  target. 

There  exists  substantial  evidence  that,  as  indicated  by  the  observations  by  Self  and  Nicoll  and 
Hsu,  eye  movements  are  anything  but  the  random-selection-with-replacement  phenomenon 
assumed  by  the  Bailey  model. 

Eye  movements  in  laboratory  studies  are  a  common  means  to  determine  if  a  model  provides  a 
good  fit  to  empirical  search  data.  A  largely  unaddressed  problem  for  such  a  validation  procedure 
is  how  to  interpret  brief  glimpses  of  100  to  200  milliseconds  in  duration.  Such  glimpses  may  be 
corrections  for  erroneous  saccades  or  brief  glimpses.  At  issue  is  what  is  considered  a  fixation 
(Karsh  &  Breitenbach,  1983).  The  neoclassical  approach  to  search  (discussed  shortly)  attempts 
to  address  this  distinction  in  a  theoretically  meaningful  way. 

Eye  movements  are  often  considered  nuisances  in  laboratory  studies  of  perception  because  of 
their  unpredictability  unless  intentionally  recorded16.  Some  methodologies  require  subjects  to 
perform  a  task  without  eye  movements.  However,  more  recent  models  from  perceptual 
psychology  have  attempted  to  incorporate  them,  since  there  is  now  a  fairly  solid  theoretical 
foundation  for  eye  movement  guidance  based  on  the  deployment  of  selective  visual  attention 
(Posner,  Snyder,  &  Davidson,  1980;  Schneider  &  Deubel,  1995;  McPeek,  Maljkovic,  & 
Nakayama,  1999).  What  has  become  obvious  to  vision  researchers,  long  after  it  was  widely 
known  to  target  acquisition  modelers,  is  that  eye  movements  do  not  agree  with  the  randomness 
implicit  in  many  basic  models  such  as  Bailey  and  ACQUIRE.  In  fact,  some  recent  evidence 
shows  that  eye  movements  are  not  performed  randomly  without  replacement  but  pseudo- 
randomly  with  replacement  (Horowitz  &  Wolfe,  1998). 


16Many  such  studies  are  interested  in  covert  attentional  shifts  that  do  not  require  eye  movements.  Eye 
movements  in  these  studies  are  considered  unwanted  noise. 
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6.4  Models  of  Visual  Search 


Because  of  the  evidence  for  a  link  (perhaps  even  an  obligatory  one;  see  McPeek,  et  ah,  1999) 
between  focal  attention  and  eye  movements,  it  would  be  beneficial  to  briefly  review  some  recent 
models  of  attention  on  visual  search  from  the  perceptual  psychology  literature.  Models  of 
interest  include  Wolfe  and  colleagues’  Guided  Search  models  (Wolfe,  Cave,  &  Franzel,  1989; 
Wolfe,  1994;  Wolfe  &  Gancarz,  1996),  and  Humphreys  and  Muller’s  SERR  model  (1993).  All 
these  models  incorporate  stimulus-driven  and  goal-directed  selection  of  attention.  That  is, 
attention  may  be  drawn  to  salient  regions  of  the  scene,  or  it  may  be  directed  overtly  about  the 
scene  by  the  observer. 

Common  to  the  models  is  the  notion  of  a  pre-attentive  stage  of  processing  and  an  attentive  stage. 
Pre-attentive  processing  is  large  capacity,  parallel,  and  operates  over  much  of  the  visual  field. 
These  mechanisms  operate  on  the  level  of  the  features  that  constitute  objects  rather  than  objects 
themselves.  Focal  attentive  processing  is  small  capacity,  serial  or  limited  capacity  parallel,  and 
operates  on  objects  in  the  field  a  few  at  a  time.  Focal  attention,  with  or  without  overt  eye 
movements  to  the  region  of  the  scene,  is  assumed  to  be  required  for  the  proper  binding  of 
features17  into  coherent  objects  (Treisman  &  Gelade,  1988)  and  for  the  conscious  perception  of 
objects  (Rensink,  O’Regan,  &  Clark,  1997). 

The  various  versions  of  Guided  Search  all  consist  of  two  stages,  a  pre-attentive  stage  and  an 
attentive  stage  (see  appendix  A).  The  pre-attentive  stage  extracts  features  from  the  scene  along 
various  feature  dimensions  separately  (e.g.,  color  opponency,  orientation,  luminance,  motion). 
The  attentive  stage  uses  information  about  a  known  target  (if  one  is  available)  to  select  from 
regions  of  the  scene  that  weighed  highly  on  relevant  feature  dimensions  and  then  selects  a  single 
object  to  inspect.  The  interplay  of  top-down  and  bottom- up  information  is  instantiated  in  the 
model  by  a  master  activation  map.  Search  progresses  in  a  time-limited  serial  self-tenninating 
manner  (i.e.,  one  at  a  time  until  the  target  is  found,  all  items  have  been  searched,  or  a  temporal 
cut-off  has  been  met)  from  areas  of  high  activation  on  the  master  map  to  areas  of  lower 
activation.  The  first  two  versions  of  Guided  Search  do  not  incorporate  eye  movements. 

Wolfe  and  Gancarz  (1996)  have  recently  modeled  visual  search  with  eye  movements  but  with 
fewer  features  than  previous  versions  of  the  model.  Guided  Search  3.0  assumes  that  attention, 
both  stimulus-driven  and  goal-directed,  creates  a  spatiotopic  saccadic  activation  map  corre¬ 
sponding  to  the  master  activation  map  in  earlier  versions  of  Guided  Search  (see  appendix  A). 
Maxima  in  the  map  represent  the  input  to  the  saccadic  control  system,  which  then  causes  an  eye 


1 7It  is  important  to  note  that  the  stimuli  used  in  most  laboratory  studies  of  visual  perception  consisted  of  such 
simple  elements  as  oriented,  colored  line  segments,  rotated  letters,  and  various  shapes,  usually  presented  on  a  blank 
background.  Though  such  a  methodology  allows  for  a  discussion  of  the  basic  features  comprising  a  simple  object,  it 
may  not  immediately  be  generalized  to  studies  of  military  target  acquisition.  Objects  and  scenes  of  military 
significance  cannot  be  reduced  to  basic  features,  at  least  not  features  analogous  to  those  discussed  in  the  perceptual 
psychology  literature.  One  could  argue,  though,  that  the  various  attempts  to  define  metrics  of  target  attractiveness, 
conspicuity,  and  distinctiveness  are  attempts  to  find  such  a  set  of  basic  features  to  describe  the  real  world. 
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movement.  Subsequent  saccades  to  already  searched  locations  are  initially  inhibited  by 
inhibition  of  return  (IOR).  As  IOR  fades  over  several  hundred  milliseconds,  the  activation  of  the 
location  can  again  increase  until  another  saccade  is  produced.  The  model  is  quite  simplistic  to  be 
sure  (e.g.,  it  is  concerned  only  with  luminance  and  orientation),  but  its  input  from  the  scene  and 
the  observer’s  intentions  allows  it  to  predict  nearly  all  the  perfonnance  characteristics  mentioned 
by  Self  (1969). 

Humphreys  and  Muller’s  (1993)  SERR  model  focuses  more  on  the  stimulus-driven  aspect  of 
search  than  does  Guided  Search.  The  factors  that  drive  the  ease  of  search  are  based  on  target- 
target,  target-non-target,  and  non-target-non-target  similarity  along  any  of  several  dimensions  on 
which  pre-attentive  vision  can  operate,  such  as  color,  orientation,  size,  etc.  (Duncan  &  Humphreys, 
1989).  Search  is  easy  if  targets  are  similar  to  each  other,  non-targets  are  similar  to  each  other,  and 
targets  and  non-targets  are  different  from  each  other.  As  the  degree  of  similarity  within  targets  or 
non-targets  decreases,  or  the  similarity  between  targets  and  non-targets  decreases,  search  becomes 
more  difficult.  The  model  progresses  through  search  by  rejecting  regions  of  the  scene  recursively 
until  it  locates  the  target.  Rejection  is  based  on  features  dissimilar  to  the  target  and  similar  to  each 
other;  regions  containing  many  such  features  are  rejected  en  masse. 

What  is  clear  from  both  of  these  models  and  from  other  models  that  posit  a  pre-attentive  feature 
extraction  stage  followed  by  an  attentive  selection  stage  (e.g.,  Feature  Integration  Theory  by 
Treisman  &  Gelade,  1988,  and  Treisman  &  Sato,  1990),  is  that  locations  selected  for  attentional 
scrutiny  are  anything  but  random.  As  such,  models  that  posit  the  random,  independent  selection 
of  glimpse  locations  may  be  suspect  since  (a)  that  is  not  how  search  progresses,  and  (b)  the 
probability  that  a  target  will  be  selected  on  a  glimpse  is  a  decreasing  function  of  glimpse  location 
rather  than  being  constant  (i.e.,  it  is  dependent  rather  than  independent). 

Some  target  acquisition  models  do  indeed  predict  that  glimpse  locations  are  selected  from 
regions  of  the  image  that  are  likely  to  be  a  target  (i.e.,  that  contain  target-like  information, 
however  construed).  For  example,  the  GTV  (Doll,  McWhorter,  Wasilewski,  &  Schmieder, 

1998)  model  bases  search  on  pre-attentively  selected  locations  that  have  similar  features  as  the 
(known)  target.  (GTV  is  described  in  more  detail  later  and  is  detailed  in  appendix  A.)  Also,  the 
evaluation  of  numerous  local  clutter,  distinctness,  and  conspicuity  metrics  is  based  on  the 
assumption  that  glimpses  are  directed  to  regions  of  the  image  that  are  relevant  to  the  target. 

(These  metrics  comprise  a  major  section  of  this  report  and  are  discussed  at  length  shortly.) 

Both  models  from  perceptual  psychology  and  most  models  of  target  acquisition  assume  that 
over-searching  does  not  occur.  Given  the  observations  of  Self  (1969)  and  Nicoll  and  Hsu 
(1995),  this  assumption  is  obviously  false.  That  is,  there  are  cases  when  a  target  will  fall  within 
a  prescribed  search  lobe,  will  be  discarded  as  a  non-target,  and  will  be  inspected  later  and  at  that 
time  be  judged  a  target.  There  are  also  cases  when  no  target  is  present  and  the  observer  searches 
repeatedly  over  the  scene  before  rendering  a  no-target  judgment  (e.g.,  Chun  &  Wolfe,  1996). 
Models  of  search,  as  mentioned  before,  typically  assume  that  once  a  target  falls  within  a  search 
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lobe,  it  is  either  found  or  not.  (Models  that  incorporate  random  glimpse  location  with 
replacement  do  not  make  this  assumption;  instead,  however,  they  make  another  unrealistic 
assumption  about  how  search  progresses.) 

6.4.1  The  “Neoclassical”  Approach  to  Search 

Recall  the  assumptions  of  the  classical  search  framework  as  described  by  Bailey  and  how  they 
confront  the  reality  of  search  behavior:  the  classical  approach  is  serial  and  self-terminating, 
meaning  that  search  progresses  randomly  one  item  at  a  time  until  the  target  is  fixated  at  which 
time,  it  is  either  detected  or  not.  If  it  is  detected,  search  halts.  Self  (1969)  pointed  out  that  search 
does  not  progress  in  this  orderly  manner:  objects  are  not  selected  at  random,  objects  are 
searched  more  than  once,  and  objects  close  to  the  center  of  the  FOV  tend  to  be  searched  first. 

Though  a  pre-attentional  saccadic  guidance  stage  can  alleviate  some  of  these  difficulties,  such  a 
remedy  cannot  address  the  fact  that  in  the  real  world,  observers  search  the  same  object  more  than 
once.  The  violation  of  this  assumption  draws  into  question  the  assumption  that  search  can  be 
described  as  a  single  Poisson  process. 

Nicoll  and  colleagues  (e.g.,  Nicoll,  1994;  Nicoll  &  Hsu,  1995;  Cartier,  Nicoll,  &  Hsu,  1998) 
have  proposed  a  different  way  to  model  search  and  detection.  The  neoclassical  framework  is 
based  on  a  different  set  of  assumptions  about  how  an  observer  actively  goes  about  searching. 

The  phenomenal  underpinnings  of  the  model  are  similar  to  Yarbus’s  (1967)  description  of  eye 
movements:  “the  human  eye  can  only  be  in  one  of  two  states:  in  a  state  of  fixation  or  in  a  state 
of  changing  the  point  of  fixation.”  When  one  is  searching  for  a  target,  Yarbus’s  description  can 
be  described  as  having  three  states:  (1)  fixating  on  the  target,  (2)  fixating  on  a  non-target,  and 
(3)  changing  the  point  of  fixation.  The  modelers  in  the  neoclassical  framework  describe  the  first 
two  states  as  “examining  points  of  interest  (POIs)”  and  the  third  state  as  “wandering.”  Search 
can  therefore  be  described  by  a  Markov  process  containing  these  states  and  the  rates  of 
transitions  between  them. 

Unlike  the  classical  framework  in  which  time  to  first  fixation  of  a  target  (and  thus  detection)  is 
an  exponential  function  of  total  search  time,  the  neoclassical  framework  assumes  that  detection 
of  a  target  is  an  exponential  function  of  time  spent  examining  the  target  itself,  not  search  overall. 
That  is,  a  certain  amount  of  time  must  be  spent  examining  the  target  POI  for  it  to  be  detected. 

In  more  detail,  a  scene  contains  i  POIs,  with  POI(O)  being  defined  as  the  target  and  POI(l) 
through  POI(i-l)  defined  as  non-targets.  Search  can  be  in  a  state  of  examining  any  of  these  i 
POIs.  In  addition,  search  can  be  in  an  intennediate  state  in  which  it  is  wandering  (without 
memory  for  where  it  has  been)  between  POIs.  This  wandering  state  is  referred  to  as  W.  The  rate 
at  which  target  infonnation  is  accumulated  is  defined  as  a(0). 

The  rates  describing  the  transitions  between  these  states  can  be  written  as  follows: 

w  =  average  rate  of  observer  leaving  a  POI  to  wander 
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S;  =  average  rate  of  observer  entering  the  ith  POI  from  wandering 

Ji  =  average  rate  of  observer  entering  the  ith  POI  from  another  POI 

Markov  processes  are  tractable  mathematically  because  the  solution  for  the  behavior  of  the  entire 
system  is  the  linear  combination  of  exponential  terms  for  each  state.  Although  tractable,  such  a 
solution  may  become  very  complex,  since  the  number  of  potential  POIs  in  a  scene  can  be  quite 
large.  However,  the  solution  may  be  simplified  dramatically  once  the  meanings  of  the  transition 
rates  is  made  clear.  The  rates  of  entering  a  POI,  Si  and  J;,  can  be  thought  of  as  a  function  of  the 
attractiveness  of  the  POI.  As  mentioned  before  (and  mentioned  later  in  this  report  when  clutter 
is  discussed),  there  are  a  number  of  ways  that  the  attractiveness  of  non-target  regions  can  be 
modeled.  The  output  of  some  pre-attentive  mechanism,  as  mentioned  before,  seems  to  play  a 
role  in  search  performance.  Various  local  metrics  for  conspicuity  and  clutter  have  also  been 
proposed.  All  these  metrics  and  processes  are  involved  in  the  designation  of  local  regions  of  the 
scene  that  contain  information  that  is  “target  like.” 

In  addition  to  points  of  local  clutter  in  a  scene,  there  is  also  good  evidence  that  global  measures 
and  metrics  of  clutter  influence  perfonnance  (decrease  Pd  and/or  increase  response  time)  without 
appealing  to  the  detailed  spatial  information  in  the  scene.  Such  overall  metrics  of  clutter  or  non¬ 
target  scene  attractiveness  can  be  thought  of  as  the  rate  at  which  any  non-target  POI  is  entered 
from  wandering.  This  global  attractiveness  assumption  allows  us  to  simplify  the  model 
dramatically  by  lumping  all  the  states  wherein  the  eye  is  neither  wandering  nor  examining  the 
target  as  a  single  state:  examining  a  non-target  POI.  The  solution  then  becomes  the  linear 
combination  of  three  states.  The  model  can  be  further  reduced  into  a  two-state  model  if  the 
target  is  not  considered  to  have  a  different  attractiveness  than  non- targets. 


Figure  2.  The  complete  state  description 
diagram  for  the  neoclassical 
search  of  a  target,  POI(O),  among 
i-1  distinct  non-targets  points  of 
interest,  POI(l)  to  POI(i-l). 
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The  neoclassical  framework  has  the  advantage  of  making  falsifiable  predictions  about  search 
times,  in  that  observer  behavior  should  be  the  linear  combination  of  exponential  random 
variables.  Nicoll  and  Hsu  (1995)  used  eye  tracker  data  from  a  NVESD  study  (O’Kane,  Walters, 
&  D’Angostino,  1995)  to  examine  the  specific  predictions  of  the  memory-less  three-stage 
Markov  search  model.  The  predictions  of  the  search  portion  of  the  model  and  the  analysis  of 
results  are  as  follow: 

1.  Targets  are  not  always  detected  upon  first  visit.  The  probability  of  detection  on  a  visit  is 
independent  of  overall  time  spent  searching. 

The  first  statement  is  obviously  true.  The  second  statement  is  not  true;  there  is  a  weak 
correlation  between  time  searching  and  Pd  on  a  particular  visit.  The  authors  attribute  this  result 
to  the  non-exponential  character  of  visit  duration  during  detection  visits  (discussed  next). 

2.  A  memory-less  Markov  process  implies  that  the  searcher  will  return  to  the  target  after 
detection  (i.e.,  the  process  itself  does  not  include  a  termination-upon-detection  requirement 
as  was  assumed  in  Bailey  [1970]  and  other  classic  framework  models). 

Eye  movement  data  clearly  support  this  prediction. 

3.  The  duration  of  pre-detection  visits  to  a  target,  during  detection  visits  (when  detection 
actually  occurs),  and  post-detection  visits  should  all  be  equivalent  and  should  be 
exponentially  distributed. 

The  pre-  and  post-detection  visit  durations  are  exponential  and  essentially  identical.  However, 
the  during  detection  visit  durations  tend  to  be  longer  (in  the  case  of  the  test  data,  nearly  twice  as 
long)  as  pre-  and  post-detection  durations,  there  were  few  very  short-duration  visits,  and  the 
distribution  lacked  a  tail  of  long-duration  visits.  From  these  data,  it  seems  that  during  detection 
visits  are  more  normally  than  exponentially  distributed.  The  authors  posit  that  this  delay  may 
have  been  attributable  to  a  motor  response  and  some  sort  of  inhibition  in  the  eye  movement 
system.  As  discussed  in  the  next  section  of  this  report,  it  could  also  be  that  a  different  strategy 
was  used  for  verification  leading  to  a  detection  rather  than  checking  when  no  detection  decision 
was  made. 

4.  The  distribution  of  the  time  to  the  first  target  visit  is  described  by  one  or  two  exponentials 
(depending  on  whether  all  POIs  are  equivalent  or  target  and  non-target  POIs  are 
different). 

The  distribution  of  first  visit  times  is  actually  close  to  an  exponential  but  only  after  a  delay.  This 
result  is  consistent  with  observations  in  the  scene  perception  literature,  indicating  that  observers 
do  not  begin  immediately  searching  the  scene  when  it  appears.  When  an  observer  is  confronted 
by  a  new  scene,  he  first  spends  a  few  hundred  milliseconds  glancing  around  at  it  to  “get  his 
bearings”  and  extract  the  spatial  layout  or  “gist”  of  the  scene  (Intraub,  198 1  )18. 


i  o 

Upon  reflection,  this  observation  is  obvious  even  within  the  logic  of  the  neoclassical  framework.  Some  visual 
and  possibly  cognitive  process  has  to  extract  scene  information  sufficient  to  delineate  points  of  interest  before  the 
search  process  as  described  by  the  model  can  begin. 


31 


5.  The  distribution  between  gaps  (times  between  visits  to  the  target)  is  described  by  one  or 
two  exponentials. 

The  data  examined  indicate  that  a  two-exponent  process  provides  good  agreement  with  the  data. 

6.  A  memory-less  Markov  process  implies  that  the  gaps  before  and  after  detection  will  be 
distributed  in  the  same  way. 

After  detection,  the  gaps  are  not  distributed  exponentially.  The  search  process  returns  to  the 
target  too  soon  after  detection  for  it  not  to  have  learned  (i.e.,  search  is  not  a  memory-less 
process). 

The  detection  process  (i.e.,  the  assumption  that  detection  is  based  on  time  exploring  the  target 
and  not  search  time  overall)  makes  two  additional  predictions  within  the  framework  of  the 
Markov  process: 

7.  The  probability  of  detection  is  exponential  in  the  time  on  target. 

This  basic  premise  of  the  detection  process  is  supported  by  the  data. 

8.  The  distribution  of  the  number  of  targets  detected  (across  all  trials  in  the  data  set)  is 
described  by  two  or  three  exponentials. 

This  hypothesis,  too,  is  supported  by  the  data  when  the  search  time  is  shifted  to  account  for  the 
delay  in  first  visit  (see  [4]),  though  a  few  finely  grained  anomalies  remain.  For  targets  with  high 
Poo,  a  one-exponential  model  and  the  classical  framework  both  do  well;  for  targets  with  low  P  ,.,  a 
two-exponential  model  can  account  suitably  while  the  classic  model’s  predictions  are  too  low  by 
a  nearly  constant  amount. 

There  are  several  strengths  in  the  neoclassical  approach  to  search  and  detection.  First,  it 
provides  theoretical  rationale  for  known  eye  movement  phenomenology  such  as  searching  the 
target  more  than  once  and  continuing  to  search  after  detection  of  a  target.  Second,  the 
assumption  that  detection  depends  on  time  spent  examining  the  target  is  more  likely  an  accurate 
description  than  the  assumption  in  the  classical  framework  (that  detection  depends  on  time  spent 
searching  in  general).  Third,  the  notion  that  the  attractiveness  of  POIs  determines  the  rates  at 
which  their  states  are  entered  provides  a  way  to  insert  conspicuity,  attractiveness,  or  clutter,  at  a 
global  or  local  level,  into  a  theoretical  framework.  If  the  neoclassical  framework  proves  to  be  a 
better  predictor  of  overall  behavior  than  the  classical  framework,  then  the  assignment  of  rates  by 
attractiveness  may  permit  the  objective  analysis  of  such  metrics. 

Nicoll  (1994)  extended  the  basic  model  to  include  field  of  regard  searches,  multi-target  searches, 
searches  when  a  particular  state  is  assumed  to  begin  the  process  (in  accordance  with  the  observa¬ 
tion  by  Carrasco  et  al.  [1995]  that  targets  near  the  center  of  the  FOV  tend  to  be  examined  first), 
and  time-limited  searches.  Not  only  can  the  framework  accommodate  such  concepts,  but  it  still 
provides  testable  predictions. 
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A  disadvantage  is  that  the  neoclassical  model  does  not  completely  account  for  the  data  set 
examined  in  the  Nicoll  and  Hsu  study.  The  time  constant  that  must  be  added  to  first  target  visit 
times,  the  fact  that  there  is  evidence  for  memory  of  detection  (by  the  post-detection  visit  gaps), 
and  the  non-exponential  distribution  of  during  detection  target  visit  durations  all  provide 
evidence  that  the  Markov  process  model  cannot  account  for  search  without  additional 
mechanisms. 

Perhaps  the  most  glaring  shortcoming  of  the  model  is  its  assumption  of  memory-less  search. 

Such  an  assumption  negates  the  possibility  of  cognitive  search  strategies  (e.g.,  systematic  search 
of  the  scene  or  deciding  not  to  revisit  a  previously  searched  region),  when  it  is  obvious  that 
observers  use  such  strategies  to  search!  Of  course,  Markov  model  predictions  are  based  on 
distributions  across  trials,  so  unless  subjects  used  similar,  consistent  search  strategies,  the  model 
would  be  unable  to  detennine  if  its  assumptions  were  incorrect.  That  is,  if  subjects  used  an 
evenly  distributed  (in  space)  variety  of  search  strategies,  then  the  data  would  still,  by  chance 
alone,  show  an  exponential  distribution  of  detection  numbers,  gap  times,  etc.  Presumably, 
though,  the  data  would  not  fit  as  tightly  around  an  exponential  curve.  Once  again,  individual 
differences  are  relegated  to  the  error  term. 

6.4.2  What  is  Happening  During  Detection? 

Nicoll  and  Hsu’s  (1995)  finding  that  distributions  of  target  visit  durations  are  longer  and  less 
exponential  when  a  detection  is  made  than  before  or  after  a  detection  is  made  indicates  that  some 
other  process  is  involved  in  detection.  What  is  that  process?  It  may  be  instructive  to  be  more 
clear  what  the  authors  meant  by  a  “visit”  to  a  POI.  Eye  movements  do  not  simply  go  to  a 
potential  target,  sit  there,  then  fly  to  another  point.  (If  that  were  the  case,  then  no  “wander” 
points  could  be  empirically  determined.)  Rather,  eye  movements  tended  to  be  of  two  types: 
sequences  of  short  (in  distance)  saccades  around  a  small  region,  and  one  or  two  long  saccades 
between  these  sequences.  The  inflection  points  between  two  long  saccades  were  defined  as 
“wander”  points  (they  typically  lasted  only  around  100  ms,  a  period  likely  too  short  to  extract 
much  infonnation  [Cartier  et  ah,  1998]).  The  sequences  of  short  saccades  around  a  region  were 
defined  as  “examination”  points  around  a  single  POI.  In  other  words,  detection  was  based  on  the 
accumulation  of  time  spent  making  saccades  and  extracting  information  from  a  region,  not 
fixating  directly  at  a  target. 

This  distinction  gives  rise  to  the  possibility  that  the  process  of  detection  of  a  target  may  actually 
be  a  discrimination  process  in  which  the  target  must  be  discriminated  from  a  non-specific  “non¬ 
target”  class  of  scene  elements19.  Since  the  assumption  of  all  these  models  is  that  the  observer  is 
aware  of  what  a  target  looks  like  (how  else  could  the  non-target  POIs  have  been  selected?), 
perhaps  the  time  examining  the  potential  target  POI  is  actually  spent  by  a  discrimination  process. 


19This  redefinition  of  detection  is,  of  course,  a  tautology.  However,  it  may  be  meaningful  in  the  context  of  a 
difficult  search. 
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Stark  and  colleagues  (e.g.,  Noton  &  Stark,  1991;  Hacisalihzad,  Stark,  &  Allen,  1992;  Stark, 

1993)  proposed  the  scan  path  theory  positing  that  observer  eye  movements  examine  a  potential 
target  for  known  features  and  then  recognize  or  reject  the  object  based  on  the  concordance  of 
observed  and  expected  features.  The  examination  of  a  target  for  discrimination  requires  a 
sequence  of  anticipatory  saccades  toward  known  points  (comers)  of  a  target.  Unfortunately,  the 
scan  path  theory’s  limitation  to  large,  clearly  defined,  familiar  objects  in  a  particular  orientation 
makes  it  unsuitable  for  target  acquisition  modeling.  A  more  complete  description  of  a  number  of 
Stark’s  models  is  presented  in  Lind  (1995). 

What  then  is  going  on  during  detection  that  slows  the  search  process?  Some  sort  of  feature¬ 
matching  process  may  be  in  play.  Also,  as  Nicoll  and  Hsu  (1995)  postulated,  there  could  be  a 
motoric  delay  that  slows  search  while  a  detection  decision  is  physically  rendered  (though  why 
that  should  change  the  distribution  from  an  exponential  is  unclear).  It  could  also  be  that  as  more 
information  accumulates  about  the  target,  the  more  processing  time  is  required  for  the  addition  of 
information  and  for  evaluations  of  that  information.  (The  actual  process  of  detection  of  a  target 
is  not  specified  in  this  model,  only  its  temporal  character.) 

6.5  Clutter  and  Its  Effects  on  Performance 

It  may  be  worth  mentioning  at  the  beginning  of  this  section  that  the  term  “clutter”  has  no  analog 
in  perceptual  psychology.  Perceptual  psychology  tends  to  view  a  scene  as  a  collection  of 
features  (e.g.,  Wolfe,  1998),  surfaces  (Nakayama  &  He,  1994),  or  oriented  visual  primitives 
extracted  by  early  cortical  mechanisms  (such  as  line  segments,  e.g.,  Grossberg,  1997).  However, 
one  of  the  most  consistent  findings  in  the  visual  search  literature  is  that  response  time  increases 
with  display  size  (number  of  non- target  distractors).  As  mentioned  earlier,  a  non-target  is  only 
considered  to  be  a  hindrance  in  search  (i.e.,  is  only  considered  to  be  clutter  or  to  be  a  distractor  in 
the  literal  sense)  if  it  cannot  readily  be  eliminated  from  consideration  because  it  is  similar  to  the 
target  (Egeth,  Virzi,  &  Garbart,  1984;  Duncan  &  Humphreys,  1989).  Just  what  it  is  about  a  non¬ 
target  that  is  important  (e.g.,  the  color,  size,  shape,  orientation,  proximity,  depth,  etc.)  is  unclear. 

It  is  also  known  that  the  homogeneity  and  distribution  of  non-targets  influence  search  difficulty. 
Duncan  and  Humphreys  (1989)  found  that  search  performance  suffered  when  (a)  non-targets 
were  similar  to  targets,  and  (b)  when  the  non-targets  were  dissimilar  to  each  other.  Nothdurft 
(1991)  found  that  targets  were  easy  to  see  if  they  differed  from  their  neighbors  in  a  single 
feature,  but  the  same  targets  embedded  near  similar  features  were  quite  difficult  to  see.  Wolfe  et 
al.  (1989,  1994,  1998)  have  modeled  the  selection  of  locations  for  the  deployment  of  focal 
attention  and  eye  movements  as  a  function  of  similarity  as  well  as  distance  between  non-targets 
and  targets,  and  Humphreys  and  Muller  (1993)  have  modeled  the  elimination  of  non-targets 
based  on  these  feature-based  similarities. 

Taken  together,  perceptual  psychology  has  a  relatively  simplified  conceptualization  of  what 
might  be  considered  clutter.  As  such,  the  quest  to  find  a  single  explanation  of  clutter  and  a  single 
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numerical  metric  for  its  magnitude  comes  largely  from  work  in  the  target  acquisition  and  ATR 
modeling  communities20. 

In  this  section,  several  metrics  for  clutter,  conspicuity,  and  distinctness  are  discussed  in  tenns  of 
what  they  measure,  why  or  how  they  are  purported  to  work,  and  how  well  they  have  fared  at 
predicting  target  acquisition  perfonnance.  The  tenns  clutter,  conspicuity,  and  distinctness,  plus 
the  term  “attractiveness,”  are  all  attempts  to  define  what  it  means  for  a  target  to  be  easy  or 
difficult  to  acquire.  Clutter  may  be  considered  the  inverse  of  the  other  three  terms,  all  of  which 
(for  the  purposes  of  this  report)  are  used  interchangeably. 

Metrics  for  clutter  can  be  local,  semi-local,  or  global.  Local  metrics  refer  to  parts  of  a  scene  that 
are  confusable  with  the  target;  semi-local  metrics  refer  to  the  amount  of  clutter  in  particular 
regions  of  a  scene;  global  metrics  refer  to  the  overall  measure  of  scene  clutter  without  any 
specific  information  about  regions  or  locations  within  the  scene. 

6.5.1  Early  Clutter  Models/Metrics 

Clutter  and  conspicuity  have  long  been  included  in  models  of  target  acquisition.  As  mentioned 
earlier,  clutter  can  affect  search  processes  (by  slowing  search,  shrinking  a  hard  shell  lobe,  and 
influencing  eye  movements)  and  detection  and  discrimination  processes  (by  increasing  the 
amount  of  information  required  from  the  target  in  order  to  acquire  it).  Different  metrics  and 
models  of  clutter  have  therefore  been  inserted  into  models  at  different  stages  of  processing. 

By  far,  the  most  common  way  that  clutter  is  modeled  is  in  its  effect  on  detection.  It  is  important 
to  note  that  a  model  predicting  only  Pd  for  an  ensemble  cannot  detennine  in  what  stage  of  target 
acquisition  (search,  detection,  recognition)  clutter  has  its  effect.  However,  if  a  local  clutter 
metric  can  predict  eye  movements  (e.g.,  Rotman,  Kowalczyk,  &  George,  1994;  Engel,  1977), 
then  such  a  distinction  may  be  made,  even  when  Pd  is  the  only  performance  measure.  For 
example,  if  eye  movements  reveal  that  high-clutter  scenes  contain  few  fixations  near  the  target, 
then  the  effect  of  clutter  was  on  the  search  process;  otherwise,  it  was  on  the  detection  process. 

Ryll  (1962)  modeled  the  effect  of  clutter  in  terms  of  the  probability  of  recognition  within  a 
fixation: 


1  +  v0.29t0'93  J 

in  which  M  =  the  number  of  “confusable  forms”  in  the  fixation  and 
t  =  the  single  glimpse  time. 
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It  could  be  that  the  very  term  “clutter”  with  its  negative  connotation  as  a  collection  of  undesirable  things  may 
be  traced  to  the  fact  that  in  target  acquisition,  clutter  is  defined  to  be  negative,  a  non-target. 
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As  M  increases,  recognition  perfonnance  drops.  However,  Ryll’s  instantiation  of  clutter  may  be 
incompatible  (by  itself)  with  metrics  that  model  clutter’s  effect  by  increasing  the  average 
glimpse  time  because  as  search  slows,  perfonnance  improves21.  A  question  is,  of  course,  what 
factors  detennine  whether  an  object  in  the  scene  is  deemed  confusable.  In  the  original  studies, 
observers  “eye-balled”  the  scene  to  make  this  determination.  In  a  recent  model  (VOM, 
Akerman,  1992,  1993b)  the  Ryll  metric  is  incorporated  with  the  number  of  confusable  forms 
determined  empirically  by  means  of  Waldman's  clutter  metric  C\. 


Bailey  (1970)  instantiated  the  effect  of  clutter  into  the  search  portion  of  his  model.  Clutter 
(defined  as  a  “scene  congestion  factor”  ranging  from  1  to  10)  influences  the  probability  that  a 
target  will  be  located  during  a  glimpse: 


P(t\=  1 
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in  which  aT  =  target  size, 

As  =  search  area, 

G  =  scene  congestion  factor  {1..10},  and 
t  =  search  time. 


6.5.2  Conspicuity,  Distinctness,  and  Attractiveness 

Williams  (1966)  was  the  first  to  insert  a  metric  for  target  conspicuity  into  a  target  acquisition 
model.  His  metric  relates  to  clutter's  effect  on  detection  probability  over  time: 

Pd  =  \-e~Kpt/Ad 

in  which  Kp  =  target  conspicuity, 
t  =  search  time,  and 
Ad  =  display  area. 

Williams’  Kp  concept  is  a  way  of  modeling  the  specific  effect  that  clutter  has  on  the  number  of 
fixations  required  to  locate  the  target.  Given  an  infinite  amount  of  time,  however,  target 
performance  will  be  perfect.  Williams  recognized  that  many  factors  would  contribute  to  a  single 
measure  of  the  conspicuity  of  a  target,  but  at  the  time,  only  psychophysical  data  and  sophisticated 
metrics  existed  to  describe  luminance  contrast. 

Similarly  to  Bailey’s  (1970)  instantiation  of  clutter,  Williams’  conspicuity  metric  slows  search 
but  does  not  determine  the  probability  of  eventually  detecting  the  target.  Such  an  instantiation  of 
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An  example  of  a  model  that  includes  clutter  at  several  points  in  processing  is  the  VIDEM  model  (Akerman  & 
Kinzly,  1979).  Clutter  was  in  so  many  places  that  Akerman  removed  some  of  its  effects  from  his  later  VOM 
(Akerman,  1993b).  See  appendix  A  for  details  of  VIDEM  and  VOM. 
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clutter  requires  an  account  for  the  known  effects  of  clutter  on  detection  and  discrimination 
performance  elsewhere  in  the  model. 

Pratt  (1991)  described  several  first  order  metrics  of  target  distinctiveness.  These  metrics  are 
based  on  various  first  order  statistics  of  the  gray-level  representation  of  the  scene.  The  metrics 
are  based  on  the  mean  and  standard  deviations  of  gray  levels  across  the  target  and  the  target’s 
local  background.  Note  that  the  various  metrics,  depending  on  how  the  background  is  defined, 
may  be  considered  local,  semi-local,  or  global  (see  appendix  B  for  expressions  and  details  of  the 
metrics). 

•  Absolute  average  intensity  difference, 

•  Root  mean  square  (nns)  intensity  and  target  variance  difference, 

•  Adjusted  nns  intensity  and  target  variance  difference, 

•  Absolute  mean  intensity  plus  absolute  mean  standard  deviation  (SD), 

•  Absolute  mean  intensity  plus  target  SD, 

•  The  Doyle  metric  (Copeland,  Trivedi,  &  McManamey,  1996), 

•  The  Doylemod  metric  (Copeland  et  al.,  1996), 

•  The  nrms  metric  (Moulden,  Kingdom,  &  Gatley,  1990;  Kosnik,  1995). 

First  order  metrics  do  not  relate  pixels  to  one  another  but  are  descriptors  of  the  regions  of  the 
image  in  which  the  target  and  background  exist.  They  lack  any  information  about  where 
different  levels  of  luminance  are  with  respect  to  each  other.  An  additional  class  of  first  order 
metrics  is  the  histogram  and  histogram  intersection  metrics.  They  are  discussed  later. 

In  addition  to  the  previously  mentioned  first  order  metrics,  there  are  metrics  that  take  into 
account  the  spatial  structure  of  the  gray-level  images  rather  than  simply  the  distributions  across  a 
target  or  background  area.  These  metrics  are  referred  to  as  second  order  metrics.  Metrics  that 
take  into  account  structure  can  begin  to  address  issues  related  to  specific  information  within  a 
target,  which,  if  present  in  a  background,  will  lead  to  a  decrease  in  conspicuity  (and  thus  an 
increase  in  clutter). 

One  such  metric  that  has  been  used  in  clutter  and  conspicuity  metrics  (e.g.,  Waldman,  Wooton, 
Fiobson,  &  Leutkemeyer,  1988;  Rotman,  Tidhar,  &  Kowalczyk,  1994;  Tidhar  et  al.,  1994; 
Rotman,  Kowalczyk,  &  George,  1994;  Copeland  &  Trivedi,  1996,  1998)  is  the  gray-level  co- 
occurrency  matrix.  This  matrix  represents,  within  an  area  of  a  pixilated  image,  the  frequency  of 
one  gray-level  occurring  in  a  specified  linear  spatial  relationship  with  another  gray-level.  The 
co-occurrency  matrix,  PA(i,j),  is  a  GxG  dimension  matrix  in  which  G  is  the  number  of  gray-scale 
levels  in  the  image.  It  is  defined  by 


37 


1 

=  —£/(**=*■,  **+A  ) 

ft  k= 1 

in  which  (xk,  Xk+A)  =  a  pair  of  pixels  with  gray-levels  /  and  /; 

i  and  j  =  gray-level  values  from  0  to  a  maximum,  G,  separated  by 

A  =  a  displacement  vector,  which  is  a  function  of  the  distance,  5,  between  the 
pixels  and  the  angle  between  them. 

f  =  {1  if  Xk=i  and  Xk+A=j,  or  0  otherwise}; 

N  =  number  of  pixels  in  the  area  of  the  image. 

Waldman  et  al.  (1988)  used  the  co-occurrency  matrix  to  calculate  a  normalized  clutter  metric, 

Cn,  which  has  been  used  in  Akennan’s  VIDEM  (Akerman  &  Kinzly,  1979)  and  VOM 
(Akennan,  1992,  1993b).  Cn  represents  the  degree  to  which  the  background  texture  is  similar 
to  the  target  in  shape,  size,  and  orientation.  (See  appendix  B  for  the  calculation  of  Cn.) 

The  normalized  clutter  measure  is  computationally  demanding  and  makes  some  assumptions  that 
may  not  be  realistic  when  one  is  dealing  with  naturalistic  images.  It  is  symmetrical  in  orientation 
and  size  and  assumes  that  as  similarity  between  target  and  background  texture  elements  decreases, 
clutter  decreases  unifonnly.  That  is,  texture  elements  different  in  size  from  the  target  by  some 
amount  will  produce  the  same  clutter  (all  other  things  being  equal)  regardless  of  whether  the  target 
or  texture  element  is  larger.  The  same  assumptions  are  made  for  orientation;  there  is  no  absolute 
difference  in  orientation.  These  results  contradict  a  phenomenon  from  perceptual  psychology 
known  as  search  asymmetry  (Wolfe,  Cave,  &  Franzel,  1989;  Wolfe,  1994).  Search  asymmetry 
occurs  when  the  reversal  of  target  and  non-target  features  results  in  drastically  easier  or  more 
difficult  searches.  (For  example,  searching  for  a  vertically  oriented  target  among  oblique  oriented 
non-targets  is  much  more  difficult  than  searching  for  an  oblique  oriented  target  among  vertical 
non- targets.) 

Also,  the  Cn  metric  yields  zero  clutter  if  the  background  is  uniform,  regardless  of  the  structure  of 
the  target.  Such  a  result  is  obviously  overly  simplistic  and  points  to  a  limiting  case  to  which  the 
metric  may  or  may  not  decay  gracefully  as  background  uniformity  increases.  No  literature 
regarding  whether  such  gradual  decay  actually  occurs  has  been  found. 

Similar  to  the  normalized  clutter  metric  is  another  metric  based  on  the  gray-level  co-occurrency 
matrix:  the  texture -based  image  clutter  (TIC)  (Shirvaikar  &  Trivedi,  1992;  see  appendix  B  for 
details).  Like  Cn,  the  TIC  metric  depends  on  the  size  of  target  and  background  elements. 

However,  unlike  the  linear  weight  given  to  transitions  between  gray  levels  as  a  function  of  the 
magnitude  of  their  difference  in  CN,  TIC  squares  the  difference,  thereby  giving  more  emphasis  to 
larger  disparities  in  luminance.  According  to  the  authors,  TIC  is  only  marginally  better  than  Cn 
at  extracting  the  meaningful  structural  infonnation  from  the  co-occurrence  matrix. 
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Co-occurrency  matrices  are  calculated  one  per  displacement  vector,  A.  That  is,  an  image  has  as 
many  co-occurrency  matrices  as  there  are  positions  between  the  target  and  background  blocks. 

In  order  to  overcome  this  inherent  specificity,  Copeland  and  Trivedi  (1996,  1998)  created  a 
metric  of  target  distinctness  based  on  the  average  co-occurrency  matrix  (ACE)  (see  appendix  B 
for  details).  This  matrix  is  used  in  determining  the  distinctness  of  two  patches  of  texture  of  a 
particular  size.  It  is  based  on  all  possible  displacement  vectors  in  the  texture  model.  In 
psychophysical  tests  involving  the  detectability  of  low-contrast  geometric  targets  embedded  in 
texture  noise,  the  ACE  metric  was  judged  more  accurate  than  either  a  first  order  Doyle  metric  or 
the  target  complexity  metric,  described  next  (Copeland  &  Trivedi,  1998). 

Schmieder  and  Weathersby  (1983)  attempted  to  quantify  the  global  clutter  in  an  image  by  using 
a  measure  of  statistical  variance,  SV  (see  appendix  B  for  details).  From  the  global  SV,  an  SCR 
is  calculated  on  the  basis  of  absolute  target  contrast.  SCR  is  then  used  rather  than  SNR  as  a 
predictor  of  detection  in  a  cluttered  scene. 

The  premise  underlying  the  SV  metric  is  the  notion  that  the  visual  system  is  interested  in  areas 
of  the  scene  with  high  gray-level  variability.  Unlike  the  second  order  metrics  based  on  the  gray- 
level  co-occurrence  matrix,  SV  is  not  concerned  with  the  structure  of  the  target  or  the  back¬ 
ground  but  only  with  its  variance.  As  such,  two  perceptually  different  patterns  could  produce 
identical  SVs.  The  theoretical  justification  for  using  the  variance  of  the  gray  levels  rather  than  a 
structure -based  metric  such  as  the  co-occurrence  matrix  may  have  arisen  as  much  from  the  lack 
of  computing  power  in  the  early  1980s  as  anything  else. 

Schmieder  and  Weathersby  (1983)  found  an  orderly  relationship  between  N50  for  detection  and 
the  SCR, 

A50  = 

which  was  integrated  into  the  Night  Vision  Model  by  Nichols  and  Paik  (1993).  The  resulting 
increase  in  correlation  between  predicted  and  recorded  detection  performance  as  a  function  of 
clutter  (from  r“=0.04  to  r  =0.64)  was  significant. 

In  evaluating  SV  and  SCR  in  an  urban  environment,  Cathcart,  Doll,  and  Schmieder  (1989)  found 
that  the  metric  underestimated  performance  compared  to  “rural”  clutter.  Such  a  result  indicates 
that  factors  such  as  expectations  and  other  sources  of  contextual  scene  infonnation  may  be  as 
important  as  image  variance  in  detennining  performance  in  some  situations.  Birkmire,  Karsh, 
Barnette,  and  Pillalamarri  ( 1 992)  found  that  global  S  V  was  a  poor  predictor  of  overall  search 
time.  When  SV  was  calculated  for  blocks  of  a  display,  Rotman,  Kowalczyk,  and  George  (1994) 
found  that  SV  did  not  correlate  highly  with  eye  movements  (i.e.,  fixations  to  regions  of  high 
clutter)  in  search. 

Based  on  two  assertions  (that  most  targets  tend  to  be  more  symmetric  than  non-targets  and  the 
visual  system  is  able  to  efficiently  detect  regions  of  high  local  symmetry),  Reisfeld,  Wolfson, 
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and  Yeshurun  (1995)  proposed  a  semi-local  or  global  metric  for  eight-axis  (circular)  symmetry, 
CSs  (see  appendix  B  for  details).  Although  the  authors  reported  that  the  metric  did  a  reasonable 
job  of  predicting  near-target  fixations  for  aerial  views  of  symmetric  ground  targets,  Rotman  et  al. 
(1994a)  found  that  the  circular  symmetry  did  not  perfonn  well  at  predicting  general  human 
fixation  behavior  in  a  naturalistic  scene.  (That  the  model  arose  from  the  discipline  of  machine 
vision  may  indicate  that  it  is  better  suited  for  locating  man-made  objects  in  general  than  for 
predicting  human  search  performance.) 

Tidhar  et  al.’s  (1994)  probability  of  edge  (POE)  clutter  metric  is  founded  on  the  idea  that  high 
spatial  frequency  edge  information  is  important  for  the  detection  of  targets.  Related  to  this  idea 
is  the  finding  that  the  visual  system  seems  to  perform  edge  extraction  early  in  visual  processing, 
thereby  creating  a  representation  of  the  scene  from  which  objects  and  surfaces  can  be  readily 
extracted  (Marr  &  Hilldreth,  1980;  Marr,  1982;  Nakayama  &  He,  1994;  Biederman,  1987). 

Rather  than  extracting  complete  edges  and  treating  them  as  elementary  features,  however,  the 
POE  metric  (see  appendix  B  for  details)  quantifies  clutter  by  counting  the  number  of  edge  pixels 
in  sub-regions  of  the  scene.  Unlike  the  SV  metric,  in  which  sharp  edges  (i.e.,  regions  with  high 
luminance  gradients)  lead  to  a  higher  SV  magnitude,  POE  merely  counts  the  pixels.  Like  SV, 
however,  POE  relates  only  the  amount  of  something  rather  than  the  structure  of  the  image. 

Unfortunately,  also  like  SV,  the  POE  metric  fails  to  accurately  predict  response  time  (Birkmire 
et  al.,  1992)  and  fixation  location  during  search  (Rotman  et  al.,  1994a).  Presumably,  a  problem 
with  the  metric  is  that  although  edges  of  objects  lead  to  edge-defined  pixels,  edge-defined  pixels 
do  not  necessarily  indicate  the  edges  of  real  objects.  Rotman,  Hsu,  Cohen,  Shamay,  and 
Kowalczyk  (1996)  evaluated  a  co-occurrency  matrix-based  clutter  metric  and  the  POE  metric. 
The  authors  determined  that  the  co-occurrency-based  metric  outperfonned  the  POE  metric  in 
predicting  observer  false  alann  responses.  The  stimuli  used  in  the  Rotman  et  al.  (1996)  study 
may  have  been  biased  more  toward  the  co-occurrence  matrix  since  they  were  “large  targets, 
possibly  camouflaged,  where  the  texture  of  the  target  region  is  of  crucial  importance”  (p.  673). 

As  such,  there  may  simply  have  been  less  information  in  an  edge  representation  of  the  targets 
than  in  their  internal  texture-like  detail. 

Rotman,  Tidhar,  and  Kowalczyk  (1994)  introduced  the  peak  signal  (AT)  metric  to  describe  the 
difference  between  average  “temperatures”  across  clusters  of  pixels  (though  any  intensity 
measure  such  as  luminance  will  also  work).  (See  appendix  B  for  details  of  how  the  metric  is 
calculated.)  In  averaging  across  the  gray-scale  image  in  order  to  form  clusters,  we  must  realize 
that  all  fine  structural  detail  in  the  scene  will  be  smoothed.  (One  input  to  the  calculation  is  the 
minimum  cluster  size,  and  no  group  of  pixels  smaller  than  that  size  is  permitted  in  the  cluster 
representation.)  An  interesting  aspect  of  this  metric  compared  to  other  second  order  metrics  is 
that  it  does  not  require  knowledge  of  the  target’s  structure;  it  concerns  only  the  gray-level  map  of 
the  image. 
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The  authors  found  that  the  metric  was  a  good  indicator  of  human  fixation  performance  in 
naturalistic  scenes.  No  other  evaluations  of  the  metric  were  found.  Toet  (1996)  has  called  the 
model  too  computationally  expensive  to  be  of  practical  use  in  his  comparison  of  clutter  and 
conspicuity  metrics. 

Another  metric  that  purports  to  extract  meaningful  infonnation  about  a  target  from  a  description 
of  edge-based  information  is  the  target  complexity  (TC)  metric  of  Tidhar  et  al.  (1994).  The 
metric  is  similar  to  the  POE,  but  it  adds  the  assumption  that  target  objects  will  have  more 
pronounced  edges  than  interior  details.  (Defeating  such  a  real-world  property  of  objects  is  one  of 
the  goals  of  cryptic  coloration  in  animals  and  camouflage  patterns  on  targets,  so  the  metric  has  a 
degree  of  face  validity.)  The  metric  is  based  on  the  cumulative  distribution  of  difference  of 
offset  Gaussians  (DOOG)-extracted  edge  points  on  the  target  and  its  immediate  surround.  (See 
appendix  B  for  the  rather  complex  description  of  this  metric.) 

Although  Tidhar  et  al.  (1994)  detennined  that  the  metric  did  a  reasonable  job  of  predicting  overall 
detection  RT,  the  fact  that  the  metric  is  only  defined  for  a  target  and  its  immediate  surroundings 
(usually  taken  to  be  twice  the  height  and  width  of  the  target)  leads  to  problems.  For  example,  a 
target  with  a  uniform  local  background  will  result  in  a  measure  of  TC  indicating  a  very  simple 
search,  even  though  the  scene  may  contain  much  complexity  that  would  cause  performance  to  be 
quite  poor.  Grossman,  Hadar,  Rehavi,  and  Rotman  (1995)  used  TC  as  a  basis  for  calculating  a 
signal-to-noise  metric  (analogous  to  the  calculation  of  Schmieder  &  Weathersby’s  SCR)  in  order 
to  model  false  alarms  in  cluttered  environments.  The  authors  found  that  the  metric  was  as 
effective  as  either  POE  or  SCR  at  predicting  the  trade-off  between  P(FA)  and  Pd.  (That  is,  that 
they  all  made  similar  predictions  for  how  subjects  change  their  thresholds  as  clutter  increases  to 
produce  more  false  alanns.) 

A  second  order  metric  that  incorporates  both  the  concept  of  contrast  and  its  ability  to  drive 
search  performance  and  the  fact  that  contrast  as  defined  by  a  first  order  metric  does  not  take  into 
account  the  contrast  variations  along  the  boundary  of  the  target,  is  the  complex  contrast  metric, 

K  (LillescCter,  1993).  Instead  of  modeling  contrast  as  a  function  of  maximum  or  average 
absolute  difference  between  target  and  background  regions,  K  includes  a  term  for  the  integrated 
point-by-point  contrast  around  the  perimeter  of  the  target.  (The  metric  is  defined  in  appendix  B.) 
The  ELS.  Army  Night  Lab  Static  Performance  Model  for  Thermal  Viewing  Systems  (Skjervold, 
1995)  has  incorporated  this  metric.  Although  the  metric  does  not  take  into  account  target 
structure,  that  omission  may  not  be  important  for  its  inclusion  in  a  detection  model. 

The  last  class  of  non-empirical  conspicuity  metrics  to  be  discussed  is  based  on  how  the  human 
visual  system  analyzes  the  scene  with  and  without  the  target.  These  metrics  will  produce 
estimates  of  how  distinct  an  observer  will  perceive  the  target  to  be  within  the  context  of  the 
scene;  they  do  not  estimate  the  conspicuity  of  the  target  alone.  The  basic  rationale  for  these 
models  is  that  although  the  visual  system  may  seem  to  pay  attention  to  such  things  as  complex 
contrast,  the  probability  of  edges  within  a  region,  the  distribution  of  light  and  dark  pixels,  etc., 
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visual  processing  does  not  occur  on  a  pixel-by-pixel  basis.  Visual  processing  begins  with  an 
analysis  of  the  scene  akin  to  a  Fourier  analysis.  As  such,  metrics  should  work  from  that  point 
onward. 

Infonnation  about  portions  of  a  scene  can  also  be  characterized  in  terms  of  their  gray-level 
histograms  (i.e.,  the  rank-ordered  gray  value  distribution  of  pixels  in  the  portion  of  the  image). 
Such  histograms  can  be  nonnalized  by  the  division  of  the  level  of  a  gray-level  bin  by  the  fraction 
of  pixels  that  have  that  value.  Image  regions  that  appear  visually  similar  should  have  similar 
nonnalized  histograms.  Since  the  nonnalized  histogram  is  a  first  order  metric  and  conveys  no 
information  about  the  internal  structure  of  a  region  of  the  image,  the  converse  is  not  necessarily 
true;  regions  that  have  identical  histograms  may  have  dramatically  different  appearances.  Also 
not  necessarily  true,  though  usually  the  case  in  reality,  is  that  two  image  regions  appearing 
visually  different  (e.g.,  containing  a  target  and  not  containing  a  target)  will  have  different 
nonnalized  histograms.  Conspicuity  metrics  based  on  the  nonnalized  histograms  of  images 
determine  the  degree  of  histogram  overlap  by  a  logical  intersection  of  target  and  background 
histograms.  A  greater  degree  of  overlap  indicates  less  conspicuity  (see  appendix  B). 

The  Camaeleon  model  (Hecker,  1992;  see  appendix  B)  calculates  normalized  histograms  not  on 
the  raw  gray-level  representations  of  images  but  on  images  convolved  with  band-pass  filters. 
Regions  of  the  scene  are  designated  target  and  background,  and  after  band-pass  filtering, 
nonnalized  histograms  are  created  for  the  local  energy  (based  on  chromatic  or  achromatic 
contrast),  spatial  frequency,  and  orientation  of  each  region.  The  degree  of  camouflage  (analogous 
to  magnitude  of  clutter,  or  the  inverse  of  conspicuity,  but  bound  on  [0,1])  is  defined  as  the  product 
of  the  intersections  of  all  target  and  background  histograms.  The  main  shortcoming  of  this 
metric,  of  course,  is  the  fact  that  it  is  uninterested  in  structural  details  of  the  target,  and  thus  may 
judge  a  target  to  be  well  camouflaged  when  it  is  not! 

Another  detectability  metric  based  on  neurophysiology  is  Watson’s  (1987)  Cortex  Transform 
(see  appendix  B  for  details).  This  metric  is  based  on  a  multi-channel-oriented  spatial  frequency 
analysis  of  an  image  adjusted  by  a  contrast  sensitivity  function.  It  is  called  the  cortex  transform 
because  it  mimics  the  oriented  edge  detection  of  area  18  (VI)  of  visual  cortex.  Two  images,  one 
of  a  scene  containing  the  target  and  one  without,  are  first  converted  to  luminance  contrast  images 
and  then  subjected  to  the  cortex  transform.  The  result  of  the  transform  is  a  four-dimensional 
representation  of  the  scene,  with  each  of  20  or  24  channels  (five  or  six  frequencies  at  four 
orientations  each)  weighed  at  every  point  (i.e.,  the  four  dimensions  are  x,  y,  frequency,  and 
orientation).  The  difference  between  the  strengths  of  the  target  and  no-target  components  is  the 
component’s  contribution  to  the  overall  distinctness.  Masking  is  implemented  in  the  metric 
when  the  distinctness  component  is  reduced  by  a  factor  related  to  the  component’s  background 
signal  strength.  The  distinctness  of  the  scenes  is  determined  by  the  Minkowski  sum  of  the 
coefficients. 
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The  cortex  transform,  when  masking  is  implemented,  has  been  shown  to  be  a  good  predictor  of 
human  detection  performance  for  low-contrast  scenes  (Ahumada  &  Beard,  1996;  Rohaly, 
Ahumada,  &  Watson,  1997).  Without  the  masking  term,  performance  tends  to  be  overpredicted. 
The  cortex  transform  is  an  elegant  implementation  of  known  early  visual  physiology  and 
psychophysics  in  that  it  integrates  inter-channel  masking  and  known  contrast  sensitivity 
functions  and  is  based  on  human  and  animal  physiology.  However,  it  is  a  predictor  of  pure 
detection  in  static,  achromatic  scenes,  so  its  current  usefulness  is  limited. 

Also  relying  on  the  assumption  that  differences  in  individual  oriented  spatial  frequency  channels 
constitutes  a  distinctness  metric  from  which  the  detectability  of  a  target  can  be  determined  is  the 
Perceptual  Distortion  distinctness  metric  of  Martinez-Baena  et  al.  (Martinez-Baena,  Fdez- 
Valdivia,  Garcia,  &  Fdez-Vidal,  1998;  Martinez-Baena,  Toet,  Fdez-Vidal,  Garrido,  & 
Rodriguez-Sanchez,  1998).  Like  the  cortex  transform,  the  metric  involves  a  spatial  frequency 
decomposition.  However,  the  distinctness  metric  is  based  on  changes  registered  only  in  the  few 
channels  that  provide  the  principal  structural  components  of  the  image. 

The  image  is  first  decomposed  into  radial  spatial  frequencies  representing  distinct  structural 
components  of  the  image.  The  relative  contributions  of  each  band  (wavelength  and  orientation) 
to  the  overall  image  structure  are  computed,  and  the  principal  components  are  identified.  Then  a 
set  of  oriented  Gabor  filters  is  applied  to  the  image,  based  on  the  principal  components.  Finally, 
a  difference  metric  is  created  on  the  basis  of  a  combination  of  the  differences  of  filter  output  on 
the  images  containing  and  not  containing  a  target. 

The  metric  was  evaluated  against  a  set  of  field  images  taken  during  the  DISTAF  (distributed 
interactive  simulation,  search  and  target  acquisition  fidelity)  field  test  at  Ft.  Hunter  Liggett, 
California,  in  1995  (Toet,  Bijl,  Kooi,  &  Valeton,  1997;  see  reference  for  information  about 
acquiring  image  set)  in  which  nine  vehicles  were  deployed  at  various  locations.  Scenes  were 
digitized  still  photos.  To  evaluate  the  model,  the  authors  digitally  removed  the  target  from  each 
scene  and  applied  the  metric  to  the  images  with  and  without  the  target.  The  resulting  distinctness 
metric  was  then  compared  to  an  empirical  metric  of  distinctness  by  Toet  and  colleagues  (described 
next).  The  empirical  and  calculated  distinctness  correlated  highly  (r  =  0.81).  The  calculated 
metric  also  correlated  highly  with  response  time  to  detect  the  target  in  the  scenes  (r  =  0.82).  These 
results  indicate  that  the  distortion-based  metric  may  be  a  good  overall  indicator  of  what  subjects 
use  to  guide  their  search  for  a  target  in  a  static  scene.  Like  many  of  the  metrics  in  this  section,  the 
distortion-based  distinctness  metric  is  achromatic  and  concerns  only  static  scenes. 

6.5.3  An  Empirical  Measure  of  Conspicuity 

Toet  (1996)  and  Toet,  Kooi,  Bijl,  and  Valeton  (1998)  described  an  empirical  method  for 
determining  the  conspicuity  of  a  target  in  a  scene.  They  used  Engel’s  (1977)  operational 
definition  of  conspicuity  as  being  the  peripheral  area  around  the  center  of  the  visual  field  from 
which  specific  target  information  can  be  extracted  in  a  single  glimpse.  This  definition  is 
obviously  similar  to  the  concept  of  a  visual  lobe.  Toet  and  colleagues  define  detection 
conspicuity  and  identification  conspicuity  as  the  maximum  distance  between  the  target  and 
fixation  that  permits  the  respective  level  of  acquisition. 
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Toet  and  colleagues  assert  that  it  requires  only  a  small  number  of  subjects  to  perform  a 
psychophysical  experiment  on  a  scene  and  its  target  in  order  to  determine  conspicuities 
consistent  across  a  large  group  of  observers.  The  results  of  Toet  et  al.  (1998)  indicate  that  two 
experienced  subjects  are  able  to  detennine  conspicuity  measures  that  accurately  predict  overall 
search  perfonnance  (response  time  to  detect  a  target)  for  a  group  of  observers  viewing  the  same 
stimuli.  The  agreement  between  conspicuity  and  response  time  is  a  good  indication  that  the 
measure  may  serve  as  an  efficient  and  effective  means  of  detennining  conspicuity. 

Such  an  empirical  method  may  be  of  more  use  in  future  laboratory-based  investigations  of 
conspicuity  than  in  the  prediction  of  performance  for  scenes  encountered  in  real  time.  The  authors 
have  in  no  way  detennined  lawful  or  predictive  relationships  between  characteristics  of  the  scene 
and  the  target  and  conspicuity  as  empirically  measured.  On  the  other  hand,  their  relatively  simple 
empirical  method  allows  accurate  measures  of  conspicuity  to  be  extracted  quickly,  thus  making  a 
factorial  investigation  of  scene  features  and  their  role  in  conspicuity  feasible. 

6.5.4  Other  Clutter  Issues 

Related  to  the  idea  that  discrimination  may  require  the  extraction  of  specific  target  features  is  the 
possibility  that  clutter  is  perceptually  masking  such  target  features.  Legge  and  Foley  (1980)  and 
Tolhurst  and  Barfield  (1978)  demonstrated  the  contrast  necessary  for  the  detection  of  a  sine  wave 
grating  when  it  was  accompanied  by  a  nearby  masking  grating  of  a  similar  frequency  and 
orientation.  Given  that  high  spatial  frequencies  contain  infonnation  about  edges  and  fine  detail, 
background  elements  of  similar  frequency  and  orientation  to  target  features  may  make  them  less 
visible.  Masking  is  difficult  to  measure  since  its  2-D  characteristics  are  as  yet  unknown  (see 
Olacsi  &  Beaton,  1998).  However,  implementing  masking  into  a  spatial  frequency-based  model 
of  vision  or  target  acquisition  has  been  accomplished  successfully  in  the  cortex  transform. 

Although  clutter  can  dramatically  influence  perfonnance,  there  are  some  visual  events  that  are 
known  to  “cut  through”  the  clutter:  visual  transients  and  motion.  These  visual  events  have  a 
temporal  character  that  is  absent  from  static  visual  clutter.  As  discussed  in  another  section  of 
this  report,  motion  has  long  been  known  as  a  feature  to  which  the  human  visual  system  can 
readily  attend.  Kosnik  (1995),  in  particular,  found  that  search  for  a  moving  target  was  nearly  as 
easy  when  the  target  is  viewed  on  naturalistic  terrain  as  a  uniform  background.  Likewise, 
transient  visual  events  as  used  in  laboratory  studies  are  not  only  easy  to  see  but  may  also  be 
effectively  impossible  to  ignore  (e.g.,  O’Regan,  Rensink,  &  Clark,  1999).  If  a  target  is  known  to 
be  associated  with  such  a  visual  event,  clutter  will  not  play  nearly  so  vital  a  role  in  acquisition. 

6.6  Models  and  Metrics  Based  on  Human  Visual  Physiology/Psychophysics 

Models  based  on  human  visual  physiology  and  psychophysics  focus  their  attention  on  how  the 
human  visual  system  processes  actual  scene  infonnation,  rather  than  on  how  overt  performance 
may  be  related  to  scene  variables  such  as  clutter.  These  models  are  of  interest  because  their  goal 
is  to  predict  human  performance  for  any  situation  in  which  an  observer  attempts  to  acquire  a 
visual  target.  As  such,  a  model  should  inherently  be  able  to  address  such  factors  as  sensor  type, 
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number  of  targets,  moving  or  stationary  target,  presence  of  obscurants,  level  of  clutter,  etc. 

These  factors  are  not  of  separate  interest  since  a  model  should  be  able  to  compensate  for  them  by 
virtue  of  the  fact  that  it  is  an  accurate  depiction  of  human  visual  processing  and  decision  making. 

For  this  report,  the  broad  class  of  these  models  can  be  described  as  lying  along  a  continuum  from 
psychophysical  but  non-physiological  all  the  way  to  highly  physiological  and  predictive  of 
psychophysics.  All  the  models  attempt  to  model  early  human  visual  properties.  However,  some 
go  about  it  more  by  processing  infonnation  in  stages  related  to  closed  form  expressions  of 
psychophysical  performance  or  by  appealing  to  psychometric  functions  to  determine  human 
perception  of  stimuli.  Others  approach  it  by  processing  information  based  on  stages  corresponding 
to  the  transfonnations  that  information  in  the  visual  system  undergoes  during  vision.  Neither  style 
is  necessarily  better  or  worse  than  the  other,  so  long  as  (a)  the  physiology  agrees  with  the 
psychophysics,  and  (b)  the  physiology  and/or  psychophysics  are  well  understood  enough  that  a 
broad  class  of  phenomena  can  be  modeled.  This  discussion  will  begin  with  highly  psychophysical 
models  and  move  to  more  physiological  models. 

6.6.1  The  British  Aerospace  ORACLE  Model 

The  ORACLE  model  from  British  Aerospace  (Overington,  Brown,  &  Clare,  1977;  Cooke, 

Stanley,  &  Hinton,  1995)  attempts  to  model  search,  detection,  and  discrimination  performance 
for  a  human  observer.  (See  appendix  A  for  details  of  the  model’s  operation.)  The  model  is 
based  more  on  known  psychophysics  than  on  the  physiology  underlying  the  psychophysics.  An 
important  note  about  ORACLE  is  that  it  is  modular  and  proprietary,  and  no  full  implementation 
of  all  the  modules  is  known  by  the  author  of  this  report  to  exist  outside  British  Aerospace.  The 
documentation  available  for  this  report  concerns  search,  detection,  discrimination,  and  clutter  in 
an  achromatic  image  only. 

ORACLE  bases  its  predictions  on  the  retinal  image  of  the  target  and  how  the  visual  system 
responds  to  the  retinal  image.  The  primary  assumptions  behind  the  model  are  (a)  the  edges  of  a 
target  are  more  important  than  the  target’s  total  energy  in  detennining  detectability,  (b) 
discrimination  is  a  function  of  the  visual  system’s  ability  to  distinguish  between  two  adjacent 
features  of  a  target,  each  of  which  is  approximated  to  be  half  the  target  size,  (c)  signal  strength 
must  exceed  a  noise  strength  in  order  for  a  detection  or  discrimination  to  be  made. 

Much  of  the  model’s  detail  is  concerned  with  how  the  non-linearity  of  eye  optics  and  the 
modulation  transfer  function  of  the  cornea  determine  the  point  spread  function  of  the  eye. 

Images  of  known  resolution,  contrast,  and  sharpness  are  then  subjected  to  this  function  and 
retinal  images  are  produced.  The  sum  of  the  activity  and  the  gradient  of  the  responses  of 
adjacent  photoreceptors  constitute  the  basic  signal  of  the  target. 

ORACLE  models  search  as  a  soft-shell  process.  Fixation  locations  are  selected  at  random  with 
replacement.  Glimpse  time  is  a  constant  1/3  second.  If  the  target  lies  within  the  soft  shell  lobe, 
then  acquisition  can  occur.  The  lobe  size  is  modeled  as  a  distribution  of  hard  shells  and  may 
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change  throughout  a  trial.  The  effect  of  clutter  in  the  model  is  to  influence  the  distribution  of 
lobe  sizes  to  favor  the  selection  of  smaller  shells.  That  is,  clutter  makes  search  less  efficient 
because  less  of  the  image  area  can  be  searched  at  a  time. 

An  important  aspect  of  ORACLE  is  its  ability  to  equate  its  distinction  definition  to  the  Johnson 
(1958)  criteria  embodied  in  so  many  other  models.  It  does  so  by  Fourier  decomposition  of  a 
Johnson-like  bar  pattern  into  a  fundamental  and  several  odd  sinusoids  and  detennines  whether 
ORACLE  can  distinguish  between  the  component  spatial  frequencies  at  a  given  resolution, 
contrast,  and  size. 

Although  the  model  available  to  this  reviewer  did  not  incorporate  color,  Cooke  et  al.  (1995) 
mentioned  that  such  a  version  of  ORACLE  does  exist.  Its  implementation  is  based  on  color 
opponency  between  R  and  G  cones  only.  Although  it  is  unclear  how  such  an  implementation  of 
color  processing  can  be  a  reasonable  facsimile  of  the  human  visual  system,  the  model  seemed  to 
do  well  at  a  laboratory  color  distinctness  task.  The  visibility  (signal  strength  relative  to  clutter 
strength)  of  colored  shapes  on  a  colored  background  was  judged  by  the  model  to  correspond 
highly  with  human  judgment  of  the  conspicuity  of  the  same  colored  stimuli.  Insufficient  detail 
of  the  study  and  the  implementation  of  the  model  were  provided  to  evaluate  this  claim,  however. 

Though  the  model’s  various  steps  in  processing  the  image  from  the  outside  world  (e.g.,  display 
or  sensor  or  optical  device)  through  optics,  photoreceptor  anatomy  and  physiology,  adaptation 
and  luminance  effects,  and  contrast  sensitivity  functions  are  all  based  on  well-documented 
psychophysics,  the  model  as  a  whole  has  not  been  evaluated  against  what  Cooke  et  al.  consider  a 
set  of  images  sufficient  to  test  it  in  toto.  Some  caution  is  urged  before  such  an  evaluation, 
especially  at  the  limits  of  the  known  psychophysics.  Models  such  as  this  likely  become  less 
accurate  as  the  stimuli  on  which  they  are  based  approach  the  limits  of  the  psychophysical 
measurements  used  to  develop  the  models.  Overington  (1982)  pointed  out  that  models  based  on 
psychophysics  have  specific  “envelopes  of  usage”  where  their  predictions  are  accurate.  Outside 
such  envelopes,  error  propagates  from  step  to  step  in  calculation,  resulting  in  a  potentially 
dramatic  degradation  in  overall  performance. 

A  more  serious  shortcoming  of  the  model  is  that  its  firm  foundation  in  psychophysics  has  made 
the  integration  of  top-down  (i.e.,  observer)  factors  extremely  difficult.  Currently,  there  are  no 
such  factors  in  the  model,  probably  because  the  psychophysics  behind  the  effects  of  training, 
attention,  stress,  etc.,  often  involve  setting  a  decision  criterion  or  a  processing  speed  rather  than 
changing  the  shape  of  a  psychometric  function.  Since  there  is  no  single  objective  set  of  data 
relating  observer  variables  to  psychophysics,  the  authors  have  taken  the  conservative  route  and 
omitted  it  entirely. 

A  related  shortcoming  is  the  fact  that  the  model  processes  information  in  a  single  stream  from 
image  to  retina  to  signal  to  response.  There  is  no  operation  that  takes  into  account  goal-directed 
(top-down)  or  stimulus-driven  (pre-attentive  or  bottom-up)  infonnation.  A  manifestation  of  this 
shortcoming  is  in  the  assumption  that  fixation  location  is  random,  as  opposed  to  guided  by 
interactions  of  low-  and  high-level  processes  (e.g.,  Wolfe,  Cave,  &  Franzel,  1989;  Wolfe,  1994; 
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Wolfe  &  Gancarz,  1996;  Doll  et  al.,  1998).  The  authors  readily  admit  that  this  assumption  is 
unrealistic  and  that  eye  movements  tend  toward  target-like  portions  of  the  scene,  but  they  argue 
that  “...the  effort  in  modeling  an  equivalent  level  of  detail  is  far  greater  than  the  reward  for  many 
situations”  (Cooke  et  al.,  1995,  p.  167). 

Motion  is  incorporated  into  ORACLE  only  in  terms  of  looming  motion  (i.e.,  motion  directly 
toward  the  observer).  Such  motion  is  modeled  as  an  increase  in  target  contrast  and  size,  from 
which  an  increased  signal  will  occur.  However,  such  a  gradual  increase  in  signal  strength  may 
not  account  for  the  particular  salience  characteristic  of  such  stimuli. 

6.6.2  The  Georgia  Tech  Vision  (GTV)  Model 

The  GTV  model  by  Doll,  McWhorter,  Schmieder,  and  Wasilewski  (1995),  Doll,  McWhorter, 
Wasilewski,  and  Schmieder  (1998)  and  its  military  counterpart,  visual/electro-optical  (VISEO) 
by  Doll  et  al.  (1997)  are  general  purpose  models  of  human  vision  that  can  be  used  to  model 
search  and  detection  in  dynamic,  cluttered  scenes.  Because  the  models  are  intended  to  be  true  to 
the  human  visual  system,  they  are  based  more  on  human  visual  physiology  than  on  ORACLE. 
The  optics  of  the  eye,  as  well  as  retinal  and  cortical  areas  VI  (edge  detection),  V4  (color 
processing),  and  MT  (motion  processing)  are  integrated  into  the  model’s  processing.  The 
physiology  must,  of  course,  produce  the  same  psychophysical  functions  that  underpin  ORACLE. 
However,  the  authors  chose  to  be  more  general  in  order  to  handle  situations  that  do  not  agree 
closely  with  existing  psychophysical  findings.  (The  model  is  detailed  in  appendix  A.)  Much 
detail  is  provided  in  the  text  of  this  report  because  GTV  comes  closest  (in  this  author’s  opinion) 
to  integrating  what  is  known  about  the  spatial  frequency  aspect  of  early  vision  with  what  is 
known  about  the  phenomenology  of  visual  search  and  attention. 

GTV  is  quite  complex  and  incorporates  many  aspects  of  visual  processing.  The  primary 
processes  of  interest  include  a  multi-channel-oriented  SF  model  of  feature  extraction,  texture- 
based  scene  segregation  into  object-like  “blobs,”  and  parallel  pre-attentive  and  attentive  modules 
to  calculate  two  probabilities  for  locations  in  the  image:  the  probability  that  a  blob  will  be  the 
target  of  fixation  (Pflx)  and  the  probability  that,  once  fixated,  the  blob  will  be  detected  (Pyes|fix).  A 
neural  network  learning  algorithm  detennines  the  features  that  are  to  be  stressed  in  determining 
these  probabilities.  Signal  detection  theory  is  then  used  to  determine  whether  a  blob  will  be 
determined  to  be  a  target.  Search  proceeds  by  the  selection  of  locations  that  have  high  Pflx 
without  replacement  and  determining  if  a  decision  is  to  be  made  based  on  Pyes|fix.  Outcome 
measures  of  the  model  are  Pd,  P(FA),  d',  and  RT. 

In  more  detail,  GTV  consists  of  five  modules:  (a)  a  front  end,  (b)  a  pre-attentive  module,  (c)  an 
attentive  module,  (d)  a  selection  and  training  module,  and  (e)  a  perfonnance  module.  GTV  is 
similar  to  Wolfe  et  al.’s  Guided  Search  model  in  that  parallel  pre-attentive  processes  extract 
peripheral  feature  infonnation  that  is  used  for  eye  movement  guidance.  Concurrent  with  pre- 
attentive  processing,  an  attentive  process  extracts  foveal  feature  infonnation  that  is  used  for 
discriminating  between  clutter  and  a  target. 
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•  Front  end  processing  in  GTV  concerns  retinal  factors  such  as  pigment  bleaching,  pupil  size, 
flicker,  and  transient  luminance  changes.  Color  information  is  converted  from  responses  of 
the  three  photoreceptors  to  responses  on  two  (R/G  and  B/Y)  opponent  process  pairs  and  an 
average  achromatic  cone  luminance  signal. 

•  The  pre-attentive  and  attentive  processing  modules  in  GTV  use  sets  of  filters  tuned  for 
peripheral  and  central  color,  temporal,  and  spatial  sensitivities  to  extract  features  (e.g., 
motion,  orientation,  and  spatial  frequency)  from  the  image.  Motion  information  in  the 
image  (sampled  at  30  Hz)  is  filtered  to  produce  a  scalar  local  motion  signal  and  integrated 
to  add  blur  to  the  image.  Each  module  has  a  pattern  perception  unit  that  decomposes  the 
temporally  integrated  spatial  infonnation  into  24  frequency  and  orientation  selective 
channels.  More  spatial  information  comes  from  the  cone  luminance  than  the  color 
opponency,  in  agreement  with  psychophysics.  Interactions  between  the  channels  are 
simulated  to  incorporate  spatial  masking.  Finally,  a  second  order  texture  metric  is 
calculated  and  blobs  (regions  of  different  textures,  corresponding  presumably  to  object-like 
regions  of  the  image)  are  segregated  from  the  background.  Features  in  SF  and  orientation 
domain  are  assigned  to  the  blobs  for  their  region. 

•  The  selection/learning  module  takes  the  feature  loadings  on  the  blobs  from  both  the  pre¬ 
attention  and  attention  blob  map  and  assigns  weights  to  them,  based  on  the  state  of  a  neural 
network  that  has  been  trained  (or  not)  to  look  for  a  specific  target.  This  module  is  intended 
to  mimic  the  ability  of  a  human  to  improve  in  performance  of  a  task  that  is  initially  quite 
difficult  (i.e.,  to  switch  from  controlled,  conscious  processing  of  sensory  information  to 
automatic  processing  [Schneider,  Dumais,  &  Shiffrin,  1984]). 

•  The  performance  module  determines  blob  Pflx  and  Pyes|fix,  and  simulates  a  search  process  to 
determine  P<j,  P(FA),  d',  and  RT  for  a  trial.  Pflx  for  each  blob  is  based  on  a  noisy  decision 
process  that  takes  into  account  the  weights  on  the  relevant  features  of  blobs  as  well  as  noise 
(quantum  and  neural  for  near-threshold  stimuli),  clutter  (defined  as  “the  extent  to  which 
another  blob’s  luminance,  texture,  chromatic  infonnation,  and  temporal  contrast  match  the 
current  blob”),  and  the  spacing  of  other  blobs  nearby. 

We  determine  Pyes|fix  and  RT  by  first  calculating  the  SCR  for  each  blob  in  the  image.  The  SCR  is 
taken  to  be  equivalent  to  an  effective  d',  which  in  turn  detennines  Pyes|flx  for  a  blob.  Assuming 
that  search  progresses  without  replacement  from  highest  Pflx  to  lowest  and  that  search  occurs  at  a 
constant  rate,  then  search  for  a  trial  can  be  modeled.  The  RT  to  a  decision  (either  a  false  alann 
or  a  detection)  is  determined  by  the  number  of  blobs  that  will  be  encountered  before  a  decision  is 
made. 

6.6.2. 1  Comments 

The  model  is  interesting  in  that  it  incorporates  many  human  physiology  and  perceptual 
psychological  principles.  However,  there  are  serious  issues  related  to  learning  and  to  motion 
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processing.  Learning  is  assumed  to  be  the  selection  of  features  and  combinations  of  features 
indicating  the  possible  presence  of  a  target  among  clutter.  The  processing,  especially  in  the  pre- 
attentive  module,  is  meant  to  mimic  the  function  of  learning  a  task  so  well  that  it  can  be  done 
“without  thinking”  (i.e.,  automatically  [Schneider  et  ah,  1984]).  After  sufficient  training,  GTV 
can  perform  even  quite  difficult  searches  with  ease.  The  problem  with  the  implementation  of 
learning  is  that  any  combination  of  features  can  be  learned  pre-attentively — a  phenomenon  that 
cannot  occur  in  humans.  (For  example,  performance  in  a  rotated  T/L  discrimination  task  will 
never  become  automatic  even  after  tens  of  thousands  of  trials  [Wolfe,  1998].)  Some  features 
cannot  be  processed  in  parallel  pre-attentively  but  require  focal  attention  (Wolfe  &  Bennett, 

1997;  Rensink  et  ah,  1997).  The  authors  acknowledge  that  after  training,  noise  needed  to  be 
added  to  the  input  images  in  order  for  the  model  not  to  outperform  humans  (Doll  et  ah,  1998). 

Motion  is  included  in  the  model.  However,  the  temporal  filtering  only  adds  a  scalar  motion 
feature  to  blobs  in  the  image.  Because  motion  information  is  scalar  (only  related  to  speed,  not 
direction),  the  model’s  attention  mechanism  has  no  direction  selectivity  as  the  human  visual 
system  has.  Therefore,  GTV  can  only  distinguish  between  speeds.  This  does  not  allow  the 
system  to  extract  information  about  motion  parallax  (e.g.,  how  a  moving  target’s  violation  of 
parallax  may  be  plainly  visible). 

Other,  more  minor  issues  relate  to  assumptions  made  about  when  the  model  calculates  some 
quantities  and  how  it  operates  to  make  a  decision.  The  calculation  of  all  foveal  features  at  the 
same  time  (by  attention  module)  is  not  physiologically  realistic.  The  model  would  be  more 
realistic  and  behave  identically  if  it  were  to  calculate  the  foveal  features  only  after  a  blob  is 
selected  by  the  performance  module.  (This  behavior  takes  into  account  the  unbound  feature 
aspect  of  pre-attentive  vision  by  Wolfe  &  Bennett,  1997.)  Also,  foveation  of  a  target  is  required 
for  a  “yes”  decision  to  be  made.  Even  though  the  model  is  ostensibly  based  on  the  conspicuity 
of  targets,  highly  conspicuous  targets  must  still  be  fixated  for  the  model  to  produce  a  “yes” 
response.  This  result  is  inconsistent  with  “pop-out”  (i.e.,  rapid  search  largely  insensitive  to  the 
number  of  distracting  elements). 

6.6.3  The  Wilson  (1991)  Spatial  Vision  Model 

The  basic  assumption  underlying  Wilson’s  (1991)  model  is  that  at  the  detection  and  identification 
threshold,  information  from  only  a  small  number  of  spatial  channels  that  are  most  sensitive  to  the 
target  determines  performance.  This  assumption  makes  intuitive  sense  since  a  signal  in  the  visual 
system  from  the  target  will  naturally  be  carried  by  those  channels  most  responsive  to  the  target. 
The  interesting  aspect  of  the  theory  comes  from  the  idea  that  decisions  are  based  on  the  output  of 
these  few  most  active  channels.  The  model  is  based  on  results  from  human  and  non-human 
primate  psychophysical  and  physiological  experiments,  indicating  that  spatial  tuning  of  six 
mechanisms  comprises  the  behavior  of  the  primate  retino-geniculate-cortex  (VI)  pathway. 

The  six  mechanisms  correspond  to  different  spatial  frequencies.  Lower  frequency  mechanisms 
corresponding  to  coarser  grain  details  are  selective  to  fewer  orientations;  higher  spatial  frequency 


49 


mechanisms  are  sensitive  to  a  greater  number  of  mechanisms.  The  locations  on  the  retina  that 
correspond  to  the  different  mechanisms  also  differ,  with  higher  frequency  mechanisms  at  smaller 
eccentricities  than  lower  frequency  mechanisms.  In  addition,  the  filters  have  different  contrast 
sensitivities,  consistent  with  the  contrast  sensitivity  functions  of  humans.  (See  appendix  A  for  a 
table  describing  the  filters  comprising  each  purported  mechanism.) 

Although  the  Wilson  model  of  spatial  vision  is  general  purpose  in  nature,  it  does  have 
implications  for  the  thresholds  required  for  the  detection  and  identification  of  targets  in  real- 
world  scenes.  The  model  assumes  that  the  degree  to  which  a  target  can  be  acquired  depends  on 
the  response  of  the  six  spatial  mechanisms  to  the  target  image.  More  to  the  point,  the  model 
assumes  that  a  few  highly  selective  filters  are  the  ones  that  determine  the  detectability  of  the 
target.  If  two  different  targets  stimulate  these  basis  channels  identically,  then  they  will  be 
identified  as  the  same  target,  and  discrimination  between  them  will  not  be  possible.  In  fact, 
additional  infonnation  at  other  spatial  frequencies  will  not  permit  discrimination  because  the 
information  is  not  present  in  the  filter  responses  that  go  into  the  decision. 

In  order  to  test  Wilson’s  spatial  model,  Thomas  and  Barsalou  (1995)  determined  whether  a  target 
with  sufficient  contrast  to  be  barely  detectable  or  identifiable  will  be  perceived  differently  if 
information  is  added  to  non-basis  filter  channels.  The  authors  used  images  of  B-1B  bombers  and 
analyzed  them  with  a  set  of  filters  corresponding  to  Wilson’s  model.  The  three  most  active 
channels  were  identified  and  a  new  image  consisting  only  of  information  on  these  channels  was 
created.  Subjects  judged  the  two  images  as  identical,  indicating  that  the  decisions  seemed  to  be 
based  on  these  channels  alone. 

MIRAGE  (Watt  &  Morgan,  1985)  and  MIDAAS  (Kingdom  &  Moulden,  1992)  are  not  models 
of  target  acquisition  per  se  but  are  models  of  how  physiological  processes  can  extract  meaningful 
feature  infonnation  from  a  scene.  Both  models  concern  one-dimensional  stimuli  only.  The  image 
is  sampled  at  all  locations  at  four  different  spatial  scales.  The  output  of  the  filters  at  the  different 
scales  can  only  be  interpreted  as  an  edge  or  a  bar.  The  central  difference  between  the  two  models 
lies  in  how  the  information  from  the  different  spatial  scales  is  combined.  In  MIRAGE,  the 
responses  of  all  the  filters  are  combined  before  they  are  interpreted;  in  MIDAAS,  the  filters  are 
first  interpreted,  and  then  their  interpretations  are  combined  across  scales.  The  scale  dependence 
of  MIDAAS  is  viewed  by  the  authors  as  an  asset  since  it  provides  for  more  than  one  possible 
interpretation  of  the  scene. 

6.6.4  The  Limits  of  Direct  Access  Spatial  Frequency  Models 

Models  such  as  Wilson’s  (1991)  Spatial  Vision  Model  assume  that  detection  and  discrimination 
decisions  are  based  on  output  from  a  single  set  of  tuned  pathways.  In  such  models,  the  only 
difference  between  detection  and  discrimination  arises  from  how  information  from  those 
pathways  is  used.  Models  based  on  this  assumption  (rather  than  an  assumption  that  different 
basic  operations  provide  infonnation  to  detection  and  discrimination  stages)  are  referred  to  as 
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“direct  access  multi-channel  models”  (Olzak  &  Thomas,  1992).  The  authors  examined  four 
assumptions  inherent  in  this  class  of  models  in  terms  of  discrimination  performance: 

1 .  The  observer  has  direct  access  to  the  output  of  the  channels. 

2.  The  observer  can  selectively  attend  to  a  subset  of  these  channels. 

3.  The  pathways  are  independent  of  one  another.  (Mathematically,  they  are  independent 
Fourier  components.) 

4.  Information  from  the  pathways  is  integrated  probabilistically  in  order  to  determine  the 
presence  or  absence  of  information  in  the  image  based  on  the  channel  activations. 

Unfortunately,  these  assumptions  do  not  withstand  scrutiny  well.  Olzak  and  Thomas  (1992) 
demonstrated  that  the  channels  were  not  independent  by  cueing  one  channel  and  measuring 
effects  in  other  channels.  Verghese  and  Pelli  (1994)  and  Lamb  and  Yund  (1996)  found  that 
observers  are  quite  poor  at  selecting  a  scale  bandwidth  to  attend  to  and  search,  indicating  that  at 
least  consciously,  selection  of  individual  channels  is  limited.  There  is  some  evidence  that  lateral 
masking  of  spatial  frequencies  can  occur  and  that  they  are  not  restricted  to  within-channel 
frequencies  (Ackerman,  1993a).  Finally,  Thomas  and  Olzak  (1990)  found  that  integration  of 
disparate  bandwidths  was  worse  than  integration  of  similar  bandwidths. 

Similar  to  the  Wilson  (1991)  model  is  the  physiological  saliency-based  models  of  Itti  and  Koch 
(2000).  The  underlying  premise  of  the  model  is  that  an  observer  directs  his  or  her  gaze  at  the 
most  visually  salient  location  in  the  currently  visible  retinal  image.  Perfonnance  in  the  model  is 
based  on  eye  movements  to  successive  points  of  high  salience  in  a  scene,  with  this  saliency 
represented  as  a  spatiotopic  map  of  the  scene. 

The  Itti  and  Koch  (2000)  model  determines  the  saliency  of  locations  of  the  retinal  image  through 
a  multi-feature,  multiple  scale  scheme  based  on  known  visual  psychophysiology  and  psycho¬ 
physics.  The  extraction  of  early  visual  features  takes  place  at  nine  spatial  scales  for  each  of  three 
features:  luminance  intensity,  color,  and  orientation  at  four  orientations:  0,  45,  90,  and  135 
degrees.  Extraction  at  each  location  is  performed  by  simulated  center-surround  excitation- 
inhibition  regions  akin  to  the  known  physiology  of  early  cortical  visual  processing.  Each  set  of 
multiple  scale  feature  maps  creates  one  feature  conspicuity  map  by  means  of  competition 
between  areas  of  high  activation  within  each  feature.  This  competition  takes  the  form  of  large 
spatial  scale  inhibition  corresponding  to  the  behavior  of  so-called  non-classical  receptive  fields 
present  in  visual  cortex  (Gilbert  et  al.,  1996).  The  three  conspicuity  maps  are  then  combined  into 
a  single  saliency  map  by  means  of  linear  combinations,  the  relative  weights  of  which  are 
determined  empirically,  based  on  model  perfonnance,  and  then  fixed  as  constant. 

The  model  posits  a  “winner-take-all”  process  so  that  the  next  fixation  location  is  detennined  by 
the  location  of  highest  activation  in  the  saliency  map.  After  simulated  saccade  selection  takes 
place,  the  area  of  highest  saliency  is  temporarily  inhibited  (for  approximately  500  to  900  ms)  so 
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that  it  is  not  immediately  selected  as  the  next  fixation  location.  This  inhibition  instantiates  the 
previously  mentioned  IOR  effect  widely  demonstrated  in  perceptual  psychology. 

Although  the  Itti  and  Koch  (2000)  model  is  explicitly  “bottom  up”  in  nature22,  the  authors  assert 
that  with  proper  selection  of  weights,  the  model  can  be  applied  “hands  off’  to  a  variety  of  visual 
search  situations.  These  weights  include  the  relative  weight  given  to  specific  values  of  features 
(e.g.,  to  a  particular  orientation  or  a  particular  color),  the  relative  strength  of  features  in  the 
calculation  of  the  salience  map,  and  the  temporal  characteristics  of  the  simulated  search  (e.g., 
dwell  time,  frequency  of  saccades,  duration  of  IOR).  During  evaluation  of  the  model  (described 
next),  the  authors  found  a  single  such  set  of  these  characteristics  and  ran  the  model  on  a  variety 
of  scenes  ranging  from  simple  and  conjunctive  visual  search  tasks  to  search  for  military  vehicles 
in  the  D1STAF  image  set.23 

Overall,  the  authors  report  that  the  model  showed  “reasonable  results”  (Itti  &  Koch,  2000, 
p.  1497)  across  a  variety  of  scenes  ranging  from  simple  search  to  artistic  paintings  to  outdoor 
scenes.  Although  it  is  notoriously  difficult  to  empirically  evaluate  a  set  of  saccades,  the  time  and 
number  of  saccades  required  for  the  model  to  generate  a  fixation  close  enough  to  a  target  to 
acquire  it  may  be  objectively  compared  to  human  search  for  targets  in  the  same  or  similar 
situations.  The  model  was  successfully  able  to  produce  pop-out  effects  for  simple  feature 
searched  and  slower  search  (with  number  of  saccades  increasing  as  a  linear  function  of  number 
of  distracting  elements)  for  conjunctive  search.  Thus,  for  these  simplified  scenes,  an  entirely 
bottom-up  search  strategy  may  be  sufficient  to  explain  human  behavior. 

The  model  fared  less  well  when  compared  to  human  performance  searching  for  military  targets 
in  the  D1STAF  image  set.  After  some  changes  in  the  temporal  dynamics  of  search  to  better 
match  average  human  characteristics  such  as  saccade  frequency  and  latency  (recall  that  the  Toet 
et  ah,  1997,  human  performance  data  set  did  not  contain  infonnation  about  eye  movements  but 
only  response  times  to  locate  the  target),  the  model  was  able  to  detect  the  targets  adequately  and 
in  far  fewer  saccades  than  would  be  required  if  fixations  occurred  at  random  locations. 

However,  both  the  overall  response  time  required  to  locate  the  targets  and  the  pattern  of  scene 
difficulty  as  determined  by  human  response  time  rankings  were  quite  different  between  the 
model  and  the  human  data.  Specifically,  although  scenes  that  required  more  time  for  humans  to 
detect  the  target  also  required  more  time  for  the  model  to  detect  the  target,  the  correlation  is 
extremely  weak  (it  was  not  mentioned  in  Itti  &  Koch,  2000).  In  addition,  there  was  significantly 
more  variability  in  human  response  time  across  scenes  than  there  was  in  model  response  times, 
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“  The  authors  write,  “Our  model  is  limited  to  the  bottom-up  control  of  attention,  i.e.,  to  the  control  of  selective 
attention  by  the  properties  of  the  visual  stimulus.  It  does  not  incorporate  any  top-down  volitional  component”  (Itti 
&  Koch,  2000,  p.  1492). 

“  Note  that  when  this  author  refers  to  “search”  for  a  target,  it  is  not  intended  to  imply  that  the  model  actually  had 
a  goal  of  finding  a  particular  target.  Rather,  performance  was  judged  on  the  basis  of  overall  pattern  of  simulated 
saccades  which,  eventually,  fell  close  enough  to  the  target  for  it  to  be  acquired. 
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and  in  35  of  the  44  scenes  evaluated,  the  model  was  able  to  detect  the  target  in  many  fewer 
fixations  than  humans  could. 

In  order  to  account  for  this  lack  of  agreement  between  human  and  model  perfonnance,  Itti  and 
Koch  (2000)  noted  the  differences  between  the  task  set  for  human  participants  in  the  Toet  et  al. 
(1997)  study  and  those  set  for  the  model.  Specifically,  the  participants  were  trained  in  the 
appearance  (from  three  vantage  points)  of  all  possible  military  targets  before  they  viewed  the 
DISTAF  images.  Itti  and  Koch  (2000)  assert  that,  given  the  difficulty  of  many  of  the  searches24, 
the  goal-directed  knowledge  of  possible  target  identity  possessed  by  human  participants  may 
have  biased  them  toward  poorer  performance  by  continually  drawing  their  attention  to  areas  of 
the  scene  “in  inappropriate  ways”  (Itti  &  Koch,  2000,  p.  1502). 

Parkhurst,  Law,  and  Niebur  (2002)  modified  the  Itti  and  Koch  (2000)  model  to  add  a  more 
realistic  decrease  in  peripheral  contrast  sensitivity.  More  importantly,  their  study  included  the 
collection  of  eye  movements  for  human  observers  viewing  the  same  scenes  to  which  the  model 
was  subjected.  Similar  to  Itti  and  Koch  (2000),  the  tasks  in  the  current  study  did  not  include 
visual  search  for  a  target.  Rather,  participants  were  told  to  “look  around  at  the  image”  for  the 
5  seconds  of  each  trial  (Parkhurst  et  al.,  2002,  p.  1 12).  The  model  was  evaluated  in  terms  of  how 
well  its  predictions  of  locations  of  high  scene  salience  correlated  with  observer  fixation 
locations. 

Results  indicated  that  stimulus-based  saliency  predicted  a  significant  proportion  of  variance  in 
fixation  location  variance,  with  strongest  correlation  occurring  early  during  scene  presentation. 
That  is,  when  scenes  were  first  presented  to  observers,  the  early  fixations  were  better  predicted 
by  the  model  than  were  later  fixations.  Nevertheless,  the  saliency-based  model  continued  to 
produce  significant  correlations  between  predicted  and  observed  fixations  throughout  the  trial. 
These  findings  are  consistent  with  the  notion  of  gist  extraction  (Intraub,  1981),  as  described 
earlier  during  discussion  of  the  Nicoll  and  Hsu  (1995)  results.  Specifically,  in  search  tasks,  the 
first  few  hundred  milliseconds  of  viewing  a  scene  may  be  consumed  by  the  extraction  of  overall 
spatial  layout  and  schematic  information  from  the  scene  (not  by  the  active  search  for  a  target). 
The  saccades  required  to  extract  this  information,  which  take  place  by  definition  before  there  is 
any  high-level  cognitive  representation  of  scene  content,  are  likely  guided  by  local  scene 
salience.  Only  later  do  top-town  aspects  of  gaze  selection  come  into  play.  Since  the  Parkhurst 
et  al.  (2002)  tasks  did  not  involve  search,  this  initial  stimulus-based  guidance  of  eye  movements 
may  have  been  extended.25 


24Itti  and  Koch  (2000)  omitted  the  eight  most  difficult  of  the  52  DISTAF  images  because  the  model  or  the  human 
participants  were  unable  to  detect  reliably  within  a  10-second  window. 

“  Note  that  Parkhurst  et  al.  (2002)  also  found  that  observers  showed  a  bias  toward  fixations  near  the  center  of  the 
scene,  particularly  in  early  fixations.  This  finding  may  correspond  to  “orienting”  in  the  scene  before  saccades  that 
support  gist  extraction. 
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Contrary  to  the  Parkhurst  et  al.  (2002)  findings  of  significant  stimulus-based  influences 
throughout  all  fixations  in  a  trial,  Turano,  Geruschat,  and  Baker  (2003)  showed  that  the  Itti  and 
Koch  (2000)  model  failed  to  predict  fixation  location  above  chance  levels  in  a  specific  goal- 
directed  task  unless  goal-directed  information  was  inserted  into  the  model.  Participants  in  the 
study  were  asked  to  navigate  an  unfamiliar  hallway  and  to  “walk  through  the  third  door  on  the 
left”  while  wearing  a  head  and  eye  tracker.  Recorded  fixations  were  compared  to  those  predicted 
by  (a)  an  unmodified  Itti  and  Koch  (2000)  model,  (b)  an  Itti  and  Koch  (2000)  model  weighted 
toward  target  features  (vertical  orientation  and  large  spatial  scale),  (c)  an  Itti  and  Koch  (2000) 
model  weighted  toward  target  location  (the  model  was  restricted  to  making  fixations  only  on  the 
left  side  of  fixation),  and  (d)  an  Itti  and  Koch  (2000)  model  weighted  for  target  location  and 
features. 

Analysis  of  model  predictions  and  observed  fixation  locations  was  different  from  that  done  by 
Parkhurst  et  al.  (2002)  in  that  fixations  were  not  assigned  (x,  y)  coordinates  but  were  assigned  to 
regions  of  the  scene,  based  on  contiguous  surfaces  or  objects.  Fixations  were  thus  turned  into  a 
series  of  categories  visited  by  observer  and  model  predictions.  These  sequences  of  categories 
formed  the  data  to  be  correlated. 

Results  indicated  that  the  unmodified  Itti  and  Koch  (2000)  model  and  the  model  weighted  for 
target  features  performed  no  better  than  chance  at  predicting  the  regions  of  the  display  fixated. 
The  model  weighted  for  target  location,  however,  performed  better  and  predicted  35%  of  fixation 
regions.  The  model  incorporating  both  location  and  feature  weighting  fared  best,  predicting 
nearly  48%  of  fixation  regions.  Together,  these  results  show  that  (at  least  for  simple  goal- 
directed  behaviors  such  as  walking  toward  a  target)  bottom-up  and  top-down  information  is 
required  for  a  model  to  be  able  to  predict  human  fixation  performance. 


7.  Other  Topics  of  Interest,  Not  Previously  Addressed 


7.1  Perceptual  Psychology 

In  considering  what  would  make  a  good  model  of  target  acquisition,  one  can  take  the  point  of  view 
that  it  would  be  an  application  of  a  model  of  basic  vision  or  basic  visual  performance  to  a  situation 
in  which  the  observer  seeks  a  target.  The  perceptual  psychology  community  has  long  been 
interested  in  these  basic  models  and  in  the  basic  properties  and  processes  underlying  human  vision. 
It  is  this  author’s  opinion  that  target  acquisition  models  should  attempt  to  incorporate  as  many  of 
these  basic  principles  as  possible  in  order  to  be  flexible  and  robust.  As  such,  this  section  of  the 
report  discusses  aspects  of  vision  and  visual  perception  gleaned  from  the  perceptual  psychology 
literature,  which  have  bearing  on  target  acquisition.  The  section  includes  discussions  of  color 
vision,  motion  perception,  and  the  effects  of  visual  transients. 
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7.1.1  Color  Perception 

Color  perception  is  a  key  aspect  of  human  vision.  In  order  to  account  for  how  humans  perceive 
color,  any  model  must  incorporate  the  following  factors:  luminance,  eccentricity,  and  target- 
background  luminance  contrast.  Color  perception  is  a  function  of  the  cone-type  photoreceptors, 
which  are  sensitive  to  light  only  in  the  pho topic  range  of  luminance.  During  low-light  conditions, 
the  cones  do  not  respond  and  all  vision  is  achromatic.  Cones  are  concentrated  at  the  macula  (the 
center  1  degree  of  the  retina)  and  decrease  in  density  quickly  with  eccentricity;  thus,  good  color 
vision  is  afforded  only  for  foveated  targets.  (These  two  factors  interact  in  that  during  low-light 
conditions,  the  poorest  acuity  will  be  at  fixation.)  If  the  target  and  its  background  support  have  the 
same  luminance  and  differ  only  by  color,  the  target  will  not  stand  out  clearly,  and  its  motion  (if  it 
is  moving)  will  be  difficult  to  perceive.  In  addition,  a  considerable  fraction  of  the  male  population 
suffers  from  one  kind  or  another  of  congenital  color  blindness,  indicating  that  consideration  of  an 
impaired  population  may  be  justified  in  considering  a  general  purpose  model. 

Color  is  processed  in  the  human  visual  system  by  three  types  of  photoreceptors,  each  receptive  to 
a  broad  range  of  wavelengths.  These  three  photoreceptors  are  interconnected  in  the  retina  by 
bipolar  and  horizontal  cells  and  innervated  ganglion  cells  representing  combinations  of 
excitatory  and  inhibitory  center-surround  pairs  of  red-green  and  blue-yellow  sensitivity  (Zeki, 
1993).  Substantial  differences  in  ganglion  cell  anatomy  and  physiology  between  color-sensitive 
and  luminance-only  sensitive  neurons  result  in  psychophysical  differences  between  human  color 
vision  and  non-color  vision.  (See  Zeki,  1993,  for  a  very  readable  overview  of  visual 
neurophysiology  in  general  and  color  vision  in  particular.) 

Most  models  of  target  acquisition  tend  not  to  address  color  as  a  driving  factor  in  perfonnance. 
(Models  of  low  observable  [LO]  targets  and  camouflage,  such  as  CAMELEON26,  do,  but  they 
are  the  exceptions.)  This  lack  of  consideration  in  the  modeling  literature  likely  arises  from  two 
basic  facts:  (a)  the  enemy  would  be  foolish  to  send  an  oddly  colored  target  into  a  battle  since  an 
object's  color  is  relatively  simple  to  change  to  fit  an  enviromnent,  and  (b)  electro-optical  sensors 
such  as  I2,  synthetic  aperture  radar  (SAR),  and  FLIR  have  historically  used  non-color  displays, 
so  any  color  in  visible  light  would  be  lost.  However,  with  the  advent  of  fused  sensor  systems, 
full-color  1“  devices  (image  intensifiers  that  use  more  than  single- wavelength  phosphor),  and 
false  color  FLIR,  it  would  seem  that  color  may  become  more  important  in  the  future  of  target 
acquisition  modeling. 

When  one  is  considering  color  in  perceptual  psychology,  there  are  some  issues  relevant  to  target 
acquisition  modeling  efforts:  (a)  detectability  under  equiluminance  or  near-equiluminance, 

(b)  how  color  space  is  to  be  represented,  (c)  what  levels  of  target  acquisition  are  aided  by  the 
presence  of  color  information,  and  (d)  how  color  contrast  or  salience  can  be  defined.  These 
topics  are  interrelated  to  a  certain  degree. 


76 

CAMELEON  stands  for  camouflage  assessment  by  evaluation  of  local  energy,  spatial  frequency,  and 
orientation  (Hecker,  1992). 
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Vision  for  colors  displayed  at  equiluminance  (i.e.,  figures  differing  from  their  background  by  hue 
alone)  is  known  to  be  quite  poor.  Theeuwes  (1995)  showed  that  search  for  a  newly  displayed 
stationary  target  whose  color  differs  from  its  background  is  very  slow  unless  the  target  also 
differs  from  its  background  in  luminance.  When  luminance  differences  are  sufficiently  large,  the 
target  will  become  readily  apparent,  even  though  its  color  may  not  be  known  in  advance,  thus 
demonstrating  that  color  differences  can  drive  attentional  selection27.  However,  search  for  a 
target  of  unknown  color,  even  when  its  luminance  is  substantially  different  from  the  background, 
may  be  quite  difficult  if  other  elements  of  the  scene  are  also  uniquely  colored.  That  is  to  say,  if 
the  observer  is  looking  for  a  target  on  the  basis  of  color,  he  would  have  more  difficulty  finding  it 
if  it  is  not  the  only  object  with  a  unique  color  in  the  scene  (Bacon  &  Egeth,  1994;  Theeuwes  & 
Burger,  1998;  Duncan  &  Humphreys,  1989). 

In  certain  circumstances,  color  may  be  able  to  reduce  clutter  effectively.  Clutter,  as  defined  by 
the  number  or  density  of  confusing  non-target  objects  within  a  scene,  may  be  reduced  if  non¬ 
targets  are  known  to  be  of  a  different  color  than  the  target.  Egeth,  Virzi,  and  Garbart  (1984) 
demonstrated  that  non-targets  of  a  particular  color  do  not  influence  search  response  time  if  they 
are  of  a  color  that  is  sufficiently  different  from  that  of  a  known  target.  Humphreys  and  Muller 
(1993)  incorporated  this  factor  into  their  SERR  model  of  search,  as  discussed  earlier.  This 
conceptualization  of  clutter  has  a  certain  circularity  about  it  since  if  an  object  in  the  scene  is 
confusable  with  the  target,  it  is  clutter;  if  it  is  not,  then  it  is  not  clutter.  If  the  observer  is  aware 
of  the  color  of  the  target  ahead  of  time,  then  it  may  be  argued  that  the  differently  colored  non¬ 
targets  do  not  represent  clutter.  Clutter  metrics  that  do  not  take  chromaticity  into  account  would 
not  be  able  to  incorporate  this  ability  of  the  visual  system. 

Motion  detection  is  also  quite  poor  during  conditions  of  equiluminance.  Cavanagh  and  Anstis 
(1991),  Kooi  and  deValois  (1992),  Ramachandran  and  Gregory  (1978),  and  others  have 
demonstrated  that  objects  defined  only  by  color  are  difficult  to  detect.  Kooi  and  deValois  argue 
that  the  neurophysiology  of  the  parvocellular  ganglion  cells  that  carry  color  signals  from  the 
retina  to  the  cortex,  as  well  as  the  cortical  projections  themselves,  account  for  this  lack  of  color- 
based  motion  perception.  Color  information  is  sent  through  different  parts  of  cortical  areas  V 1 
and  V2  to  V4,  where  largely  motion-insensitive  color  processing  occurs.  Non-chromatic  motion 
information,  on  the  other  hand,  is  processed  in  the  medial  temporal  (MT)28  area.  Near 
equiluminance,  however,  motion  perception  quickly  recovers  as  the  luminance  difference 
between  target  and  background  increases.  Since  most  target  acquisition  situations  involve 
targets  that  would  be  close  to  equiluminance  and  similar  in  hue  to  their  backgrounds  (presuming 


27The  uniquely  colored  item  may  have  lower  luminance  contrast  than  the  non-targets.  It  need  only  be  different  in 
some  way  for  its  color  to  become  important. 

jo 

“  The  underlying  logic  of  this  segregation  was  postulated  by  Mishkin,  Ungerleider,  and  Macko  (1983)  as  the 
separate  processing  of  “what”  information  (related  to  form  and  identity)  includes  color  and  “where”  information 
(related  to  where  the  object  is  and  where  it  is  going)  that  does  not. 
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that  appropriate  CCD  measures  are  in  use),  their  motion  signals  may  be  attenuated  compared  to 
the  motion  of  a  more  obvious  target. 

Research  in  experimental  psychology  typically  uses  (x,y)  CIE  (Commission  Internationale  de 
l’Eclairage)  color  space  plus  luminance  to  describe  colored  stimuli.  Other  choices  are  (u’,v’) 
color  space,  red,  blue,  green;  hue,  saturation,  brightness;  or  cyan-yellow-magenta-black 
coordinates,  and  weights  of  R/G  and  B/Y  opponent  pairs.  Models  of  target  acquisition  tend  to 
use  (x,y)  space  or  opponent  pairs  (e.g.,  ORACLE  and  GTV).  Although  photoreceptor  responses 
are  well  characterized,  there  is  some  disagreement  about  the  behavior  of  the  color  opponent  cells. 
The  basis  of  this  disagreement  comes  from  the  fact  that  within  a  population  of,  say,  R+/G- 
center-surround  cells,  there  is  much  variability  in  the  receptive  field  characteristics  and  the 
response  magnitudes  near  and  above  thresholds,  indicating  that  current  physiological 
understanding  may  be  inadequate  to  model  the  system  effectively. 

Recent  research  by  Olds,  Cowan,  and  Jolicoeur  (1999)  indicates  that  mapping  stimuli  into  3-D 
color  space  allows  predictions  to  be  made  about  their  salience  and  distinctiveness.  The  authors 
found  that  targets  were  readily  detectable  in  a  background  of  differently  colored  non-targets  if 
the  coordinates  of  the  colors  of  the  targets  and  non-targets  were  planar  separable29.  Eastman 
(1968)  similarly  used  distance  between  points  in  (u’,  v’,  w’  [luminance])  space  as  a  definition  of 
color  contrast.  Color  contrast  has  also  been  modeled  by  Frome,  Buck,  and  Boynton  (1981)  as  an 
equivalence  term  for  luminance  contrast.  That  is,  the  overall  contrast  of  a  target  is  modeled  as  a 
linear  combination  of  its  achromatic,  R,  G,  and  B  color  dimensions. 

Thus  far,  color  has  been  mentioned  only  in  its  role  in  detection  of  a  static  or  moving  target. 
Research  from  perceptual  psychology  indicates  that,  inasmuch  as  real-world  objects  are 
concerned,  color  plays  little  role  in  recognition  or  identification30.  That  is  to  say,  the  addition  of 
target  color  information  when  sufficient  information  already  exists  in  the  image  for  us  to 
recognize  the  target  does  not  aid  recognition  performance.  By  far,  the  best  example  of  this  is  the 
work  by  Biederman  and  Ju  (1988),  which  demonstrates  that  in  agreement  with  RBC  theory,  the 
surface  characteristic  of  color  does  not  improve  response  time  to  name  a  common  object.  Even 
an  object  readily  associated  with  a  color,  such  as  a  banana,  is  as  quickly  recognized  in  a  line 
drawing  as  a  full  color  picture. 

Biederman  and  Ju’s  finding  is  not  surprising,  given  that  first,  objects  tend  not  to  be  defined  by 
color  alone  and  second,  much  of  what  the  color  vision  system  does  is  provide  color  constancy, 
whereby  a  colored  object  will  appear  to  be  the  same,  regardless  of  the  source  of  illumination 
(Zeki,  1993).  This  so-called  “discounting  of  the  illuminant”  means  that  the  physically  measured 
color  spectrum  of  a  surface  does  not  correspond  one  to  one  with  an  observer’s  perception  of  the 
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The  “points”  in  color  space  actually  correspond  to  Gaussians  with  steep  sides;  therefore,  linear  separability  of 
the  peaks  does  not  ensure  visual  distinctness  of  the  objects. 

TO 

Most  research  uses  line  drawings  for  object  recognition.  However,  more  powerful  PC-based  rendering 
software  is  making  the  use  of  solid  models  more  common. 
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color.  Implications  of  this  finding  may  be  important  for  observers  who  view  potential  targets  in 
two  very  different  lighting  conditions.  Models  would  have  to  compensate  for  the  illuminant  in 
order  to  accurately  predict  performance  in  both  conditions. 

7.1.2  Motion 

Most  models  of  target  acquisition  focus  on  static  images  and  the  static  characteristics  known  to 
affect  perfonnance  (e.g.,  clutter,  contrast,  size,  resolution,  range,  atmospheric  interference). 
Electro-optical  models  or  front  ends  to  models  (such  as  TARGAC)  may  include  the  temporal 
response  characteristics  of  the  sensor  or  display,  but  the  treatment  of  time  dependence  in  such 
models  typically  relates  to  how  the  sensor  addresses  changes  in  the  scene  over  time  rather  than 
targets  that  may  be  moving. 

The  two  visual  effects  of  a  target  moving  relative  to  its  background  are  a  motion  signal  arising 
from  the  target  itself,  and  the  flicker-like  changes  in  contrast  around  the  borders  of  the  target 
(e.g.,  if  a  light  target  moves  over  terrain  that  is  alternatively  dark  and  light,  it  may  appear  to 
flicker  with  respect  to  its  background).  The  first  effect  is  well  studied  and  has  been  instantiated 
into  several  models;  the  latter  has  not  been  modeled  successfully. 

The  human  perception  literature  is  useful  when  we  consider  how  motion  can  influence 
performance  in  target  acquisition.  Motion  and  objects  defined  by  motion  (such  as  a  cryptically 
colored  animal  that  suddenly  moves)  are  known  to  be  especially  good  at  directing  visual 
attention  (Hillstrom  &  Yantis,  1994;  Yantis  &  Egeth,  1999;  Wolfe,  1994).  That  is,  a  moving 
stimulus  only  needs  to  be  a  fraction  of  the  physical  intensity  (e.g.,  luminance,  size)  of  a  static 
stimulus  in  order  to  immediately  become  visible.  Some  models  (discussed  next)  take  advantage 
of  this  fact  by  using  motion  to  adjust  the  effective  contrast  of  the  moving  target.  That  said, 
however,  it  is  important  to  note  that  the  effect  of  motion  on  detectability  is  not  constant;  instead, 
it  interacts  with  contrast.  If  the  contrast  of  a  moving  target  is  very  low,  it  will  remain  quite 
difficult  to  see,  regardless  of  its  speed  (Mazz,  Kistner,  &  Pibil,  1998;  Meitzler,  Kistner,  et  al., 
1998). 

Of  particular  importance  to  the  human  visual  system  is  the  appearance  of  a  “looming”  stimulus 
whose  motion  is  toward  the  observer  (Schmidt,  1997;  Yantis  &  Hillstrom,  1994).  The  only  way 
that  current  models  of  target  acquisition  have  incorporated  looming  motion  has  been  to  address 
the  resultant  increase  in  size  and  contrast  of  the  object.  However,  the  size  increase  necessary  for 
looming  objects  to  capture  attention  is  quite  small,  so  a  simple  contrast  increase  might  not  be 
adequate  to  account  for  the  phenomenon.  Therefore,  looming  should  be  treated  as  a  special  case 
of  motion. 

In  addition  to  looming  stimuli,  two  important  aspects  of  motion  perception  are  the  ability  of  the 
visual  system  to  segregate  objects  that  are  not  moving  from  those  that  are  (Watson  & 
Humphreys,  1999),  and  the  detection  of  objects  that  are  moving  in  a  way  inconsistent  with  a 
moving  observer  viewing  a  field  of  stationary  objects  (Kaiser  &  Montegut,  1995).  That  is,  if 
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there  are  objects  at  a  variety  of  ranges  from  the  observer,  their  motion  with  respect  to  the 
observer  (actually,  about  his  point  of  fixation)  will  be  determined  by  (a)  the  range  from  the 
observer,  (b)  the  speed  of  the  observer,  and  (c)  whether  the  objects  are  stationary  or  moving. 
Kaiser  and  Montegut  determined  that  humans  are  particularly  sensitive  to  objects  whose  motion 
is  inconsistent  with  the  expected  motion  parallax  at  their  position.  In  other  words,  humans  are 
good  at  spotting  moving  objects  when  they  themselves  are  moving. 

This  ability  comes  into  play  in  target  acquisition  when  the  observer  himself  is  not  always 
stationary;  rather,  the  observer  may  be  moving.  Both  of  these  aspects  of  perception  relate  to 
situations  when  an  object  in  a  scene  is  moving  differently  from  other  objects,  indicating  that  its 
retinal  velocity  is  different  from  what  a  stationary  object  at  that  location  in  space  should  be. 
Therefore,  the  object  is  self-propelled  and  is  likely  of  military  interest.  No  models  thus  far 
encountered  have  taken  relative  motion  signals  into  account  as  they  relate  to  such  implied  depth- 
related  motion  parallax,  although  simple  relative  motion  signals  should  be  able  to  be  modeled 
when  the  frame  of  reference  of  the  scene  is  changed  from  stationary  to  moving. 

Models  that  account  for  motion  tend  to  relate  it  to  the  probability  of  detection  instead  of 
discrimination  since,  if  anything,  the  structural  detail  of  a  moving  object  will  decrease  because 
of  the  loss  of  high  spatial  frequencies  (blur).  Electro-optical  systems  are  particular  susceptible  to 
blur,  depending  on  the  integration  time  of  the  sensor  and  the  sampling  and  display  rates.  The 
effects  of  blur  induced  by  motion  are  considered  in  some  models  (e.g.,  GTV). 

7. 1.2.1  Early  Models  of  Motion 

An  early,  empirical,  and  somewhat  cognitive  inclusion  of  motion  into  search  perfonnance  was  in 
Bishop  and  Stollmack’s  (1968)  DYNTACS  model.  DYNTACS  incorporated  the  effect  of  motion 
as  an  increase  in  the  probability  of  detection  within  a  time  window,  At.  The  model  parameters  are 
in  terms  of  range  and  linear  velocity,  and  the  model  included  a  tenn  for  “terrain  complexity” 
which  corresponds  to  possible  paths  in  the  scene  along  which  a  moving  target  may  travel.  As 
mentioned  earlier,  the  TARGAC  model  is  based  more  than  most  models  on  atheoretical  data  fits, 
indicating  that  its  results  may  not  be  generalizable  to  other  studies  or  situations. 

Rogers  (1972)  found  that  the  (luminance)  contrast  threshold  of  a  moving  object  remains  relatively 
constant  as  the  retinal  eccentricity  increased  to  around  55  degrees.  In  order  for  a  stationary  target 
to  remain  barely  visible  as  its  eccentricity  increases  over  the  same  level  requires  a  five-fold 
increase  in  contrast.  (Note  that  for  a  small  target,  the  change  in  receptive  field  size  with  eccen¬ 
tricity  cannot  account  for  this  finding.)  Peterson  and  Dugas  (1972)  modified  the  search  term  (Pi) 
in  Bailey’s  (1970)  model  to  account  for  motion  by  increasing  the  size  of  the  hard  shell  lobe  as  a 
function  of  angular  velocity: 

At  =  AtC(i  +  OA5ro2) 

in  which  Ag0  =  typical  glimpse  aperture  (hard  shell  lobe  diameter), 

C  =  contrast  of  target  with  background,  and 
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co  =  angular  velocity  of  target  with  respect  to  the  observer. 

(Note  that  this  instantiation  includes  the  contrast-motion  interaction  mentioned  earlier  when  the 
adjustment  in  the  glimpse  aperture  weights  contrasts  very  heavily  by  angular  velocity,  but  when 
contrast  is  near  zero,  the  magnitude  of  the  effect  of  velocity  will  be  negligible.)  Presumably, 
such  a  simple  modification  could  be  made  in  a  soft  shell  visual  lobe  calculation,  possibly  by  the 
probability-to-detect  drop-off  becoming  much  shallower  with  eccentricity.  Indeed,  an  increase 
in  soft  shell  lobe  is  exactly  what  Rogers’  result  seemed  to  indicate. 

7. 1.2. 2  More  Recent  Approaches  to  Modeling  Motion 

Meitzler,  Kistner,  et  al.  (1998)  and  Mazz,  Kistner,  and  Pibil  (1998)  investigated  the  effects  of 
motion  on  target  detection  in  controlled  laboratory  experiments.  Findings  from  both  studies 
indicated  that  angular  velocity  was  as  important  a  factor  as,  or  perhaps  even  more  important  a 
factor  than,  range  (which  determines  target  size)  or  contrast  alone  in  the  detection  of  a  target. 
However,  the  effect  of  velocity  was  not  independent  of  other  factors  in  the  study.  Mazz  et  al. 
found  that  velocity  interacted  significantly  with  range  and  with  range  and  contrast.  Meitzler 
et  al.  found  that  velocity  interacted  significantly  with  range  and  with  the  background  used  in  the 
studies  (backgrounds  were  digitized  images  of  different  clutter  levels).  Thus,  it  was  clear  that  an 
isolated  velocity  term  would  be  insufficient  for  a  model  to  account  for  the  effects  of  target 
motion. 

NVESD’s  ACQUIRE  model  (Tomkinson,  1990)  was  modified  by  Meitzler,  Kistner,  et  al.  (1998) 
to  include  a  parameter  for  target  velocity  by  making  the  probability  of  detection  a  function  of 
target  image  size  and  the  target  image  size  necessary  for  50%  ensemble  detection  of  the  target: 

(A!Aj 

in  which  A  =  target  angular  extent, 

Ac  =  target  angular  extent  necessary  for  50%  ensemble  detection,  and 
E  =  2.7  +  0.7(A/Ac). 

This  function  is  purposefully  similar  to  the  TTPF  used  in  other  NVESD  models.  However,  the 
angular  extent  necessary  for  50%  performance  (A50)  is  itself  considered  a  function  of  target  size, 
contrast,  and  angular  velocity: 

Ac  =  aTe  +  bC  +  cVa  +d 
in  which  Te  =  target  angular  extent, 
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C  =  target  contrast31,  and 
Va  =  target  angular  velocity. 

Although  all  three  terms  (and  their  interactions)  are  known  to  affect  perfonnance,  the  conditions 
in  which  one  factor  may  be  more  important  than  or  may  interact  with  others  are  not  clear.  There¬ 
fore,  the  authors  did  not  attempt  to  fit  the  constants.  Instead,  Meitzler  et  al.  used  a  fuzzy  logic 
approach  (Zadeh,  1965)  to  derive  fuzzy  rules  governing  the  influence  of  these  factors  in  different 
conditions.  One  half  of  the  data  derived  from  a  laboratory  study  in  which  target  size,  contrast, 
and  angular  velocity  were  controlled  was  used  as  input  into  fuzzy  inference  neural  network  (The 
MathWorks,  1995)  to  derive  rules  against  which  the  other  half  of  the  data  was  tested.  The 
authors  report  that  the  correlation  between  derived  fuzzy  rules  and  the  test  data  was  0.95. 

Another  way  that  motion  has  been  incorporated  into  models  of  target  acquisition  has  been  to 
include  human  visual  physiology,  as  related  to  motion  perception,  into  models  based  on  early 
visual  processes.  How  humans  process  motion  information  has  been  the  subject  of  active 
research  in  the  vision  literature  for  decades.  Studies  that  may  have  some  bearing  on  models  of 
target  acquisition  include  those  focused  on  the  detection  of  motion  signals  among  noise  (e.g., 
Snowden  &  Braddick,  1991;  Verghese  &  Stone,  1995;  Verghese,  Watamaniuk,  McKee,  & 
Grzywacz,  1999),  those  attempting  to  derive  the  basic  motion  features  to  which  the  visual  system 
is  sensitive  (e.g.,  Adelson  &  Bergen,  1985),  and  those  testing  motion  processing  as  related  to 
known  visual  psychophysics  and  physiology  (e.g.,  Grossberg  &  Rudd,  1991). 

7,1.3  Transient  Visual  Events 

Soldiers  in  the  field  routinely  encounter  situations  in  which  events  occur  that  are  only  visible  for 
a  brief  time,  such  as  the  glint  off  a  sight,  a  muzzle  flash,  the  momentary  appearance  of  an  object 
from  behind  an  occluder,  or  an  explosion.  The  presence  of  such  transient  visual  events  can  aid 
or  hinder  search  for  a  target. 

Before  we  discuss  the  specifics  of  how  transients  can  affect  search  and  target  acquisition,  it  is 
necessary  to  understand  how  the  visual  system  responds  to  such  stimuli  during  search.  It  is 
obvious  that  before  an  observer  can  acquire  a  target,  some  representation  of  the  target  must  exist 
in  the  observer’s  visual  system.  At  issue  is  the  amount  of  information  accumulated  over  time  as 
the  observer  views  a  scene.  It  has  long  been  argued  that  the  representation  is  an  “integrative  visual 
buffer”  that  collects  information  and  becomes  progressively  more  detailed  over  time  (Rayner, 
McConkie,  &  Ehrlich,  1978).  There  actually  is  no  such  buffer  and  very  little  information  about 
objects  in  a  scene  remains  when  the  scene  disappears  or  when  we  look  away  (see  Vaughan,  1998, 
however,  for  evidence  that  some  information  does  persist).  This  effect  can  be  seen  in  almost  any 
situation:  close  your  eyes,  turn  around,  and  open  your  eyes  for  1  or  2  seconds.  Then  close  your 
eyes  and  describe  as  much  of  the  scene  as  you  can.  You  will  probably  only  be  able  to  recall  details 
of  a  handful  of  objects. 
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The  contrast  term  does  not  account  for  the  flicker-like  effect  of  rapid  changes  in  target  contrast  as  it  moves 
across  terrain. 
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The  reason  why  so  little  information  persists  is  that  our  mental  representation  of  the  scene  is 
actually  very  sparse,  consisting  of  only  four  or  five  objects  at  a  time  (Rensink,  1996).  The 
mechanism  that  selects  objects  from  the  scene,  binds  their  features  properly,  and  inserts  them 
into  this  representation  is  selective  attention.  Objects  in  the  scene  that  are  clearly  visible,  yet 
unattended,  are  not  perceived  consciously  or  acted  upon  consciously  (Mack  &  Rock,  1998). 
O’Regan  (1992),  Rensink  (1997),  and  Minsky  (1985)  have  argued  that  observers  are  not 
consciously  aware  of  the  sparseness  of  their  mental  representation  because  the  scene  itself  serves 
as  an  external  memory.  In  order  to  acquire  information  about  a  scene,  the  observer  must  focus 
his  attention  on  a  part  of  the  scene,  and  that  part  is  then  encoded  into  the  mental  representation. 

The  role  that  selective  attention  plays  in  conscious  perception  is  the  key  to  understanding  how 
transient  visual  events  affect  target  acquisition.  Search  for  a  target  includes  a  series  of  eye 
movements  to  locations  in  the  scene  similar  to  a  target  along  some  dimension  or  to  locations  as 
determined  by  a  top-down  scan  path.  It  is  in  the  first  case  that  transients  have  their  effect,  since 
attention  is  presumed  to  precede  eye  movements  to  a  location  in  the  scene  that  is  of  interest. 

This  “spotlight”  of  attention  can  readily  be  deployed  to  salient  or  conspicuous  regions  of  the 
scene  (Yantis  &  Egeth,  1999);  thus,  target  conspicuity  may  detennine  the  probability  that  the 
target  will  be  attended  and  fixated.  Transient  visual  events  have  the  ability  in  certain  circum¬ 
stances  to  disrupt  this  salience-based  attentional  deployment  system  (e.g.,  Yantis,  1996; 
O’Regan,  Rensink,  &  Clark,  1999)  and  involuntarily  summon  or  “capture”  attention  to  their 
locations. 

If  the  transient  event  occurs  at  the  location  of  the  target  (such  as  a  glint  or  muzzle  flash),  then 
such  a  transient  will  increase  the  probability  that  the  target  will  be  fixated.  In  addition  to  a 
sudden  increase  in  luminance  or  contrast,  the  appearance  of  a  new  perceptual  object  (e.g.,  when 
an  object  suddenly  becomes  visible  as  it  appears  from  behind  an  occluder)  is  also  known  to 
capture  attention  (Hillstrom  &  Yantis,  1994;  Yantis  &  Hillstrom,  1994).  Attention  may  be 
captured  even  if  the  contrast  of  the  new  target  is  not  sufficiently  high  to  be  judged  as  salient  or 
conspicuous  if  it  had  not  just  appeared.  An  interesting  aspect  of  attentional  capture  is  that  it  can 
occur  even  if  the  scene  is  highly  congested.  Therefore,  any  model  that  incorporates  clutter  into 
dynamic  scenes  must  treat  visual  transients  as  a  special  case  in  which  the  effects  of  clutter  are 
strongly  attenuated. 

Attention  is  not  always  captured  by  transient  events,  however.  As  is  the  case  with  moving 
objects,  transients  will  only  capture  attention  when  the  increase  in  luminance  or  contrast  or  when 
the  contrast  of  the  new  object  is  sufficiently  high32.  Enns  and  Austen  (1999)  found  that  low 
contrast  targets  failed  to  capture  attention  in  such  circumstances,  but  moderate  contrast  targets 
did  so  quite  effectively.  Valeton  and  Bijl  (1995),  in  evaluating  the  TARGAC  model  on  data 
from  the  Battlefield  Emissive  Sources  Trials  (BEST)  under  the  European  Theater  Weather  and 
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Note  that  the  increase  in  luminance  or  contrast  associated  with  the  transience  will  render  it  far  more  likely  to 
capture  attention  than  a  static  object  with  the  same  high  luminance  or  contrast  (Yantis,  1996). 
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Obscurants  (TWO);  NATO,  1990)  studies,  found  that  targets  that  appeared  suddenly  were 
particularly  difficult  to  see.  The  targets  tended  to  be  small  and  of  low  contrast.  The  reason  why 
observers  in  the  BEST  TWO  study  found  these  targets  particularly  difficult  is  that  in  the  low 
contrast  conditions,  they  were  no  more  salient  than  other  targets  and  were  available  in  the  scene 
for  less  time. 

In  addition  to  transient  events  failing  to  alert  the  observer  to  a  potential  target,  they  may  also 
hinder  search.  If  a  transient  event  occurs  at  a  non-target  location  or  at  an  already  acquired  target, 
attention  may  be  captured,  thereby  disrupting  a  salience-  or  conspicuity-driven  search  of  the 
scene.  O’Regan,  Rensink,  and  Clark  (1999)  demonstrated  that  when  a  “mud  splash”  (a  convex 
gray  region)  was  repeatedly  added  to  and  taken  away  from  a  scene,  the  time  required  to  search 
for  a  target  increased  dramatically33.  Even  when  the  target  itself  represented  a  transient  event 
(such  as  a  sudden  movement  or  color  change  or  a  sudden  appearance  of  a  new  object),  the  more 
salient  mud  splash  disrupted  search. 

In  target  acquisition  situations,  the  effect  of  irrelevant  transients  is  likely  to  be  manifested  by  a 
change  in  search  strategy  from  an  efficient  one  to  an  inefficient  one.  Target  conspicuity  has  been 
the  result  of  much  study  because  it  is  a  good  predictor  of  search  performance.  The  reason  why 
conspicuity  drives  search  performance  is  that  the  attention  system  is  able  to  quickly  select 
conspicuous  regions  in  the  scene  during  search.  Other,  less  conspicuous  regions  are  not  searched 
because  targets  are  deemed  less  likely  to  be  in  them.  When  search  is  difficult,  a  more  systematic, 
top-down  search  strategy  is  employed  that  involves  a  conscious  pattern  of  searching  the  scene, 
often  including  parts  of  the  scene  where  no  target  is  likely  to  be.  In  the  presence  of  transients,  an 
analogous  strategy  is  employed.  O’Regan  et  al.  found  that  repeated  mud  splashes  forced  subjects 
to  abandon  a  conspicuity-driven  search  and  adopt  a  slower,  systematic  search.  Search  models 
that  include  transients  may  benefit  from  the  addition  of  a  systematic  search  that  occurs  when 
such  repeated  transients  occur. 

7.2  Multiple  Targets 

War  game  simulation  as  well  as  real-world  combat  situations  are  not  exclusively  single-target 
scenarios.  Often,  a  Soldier  is  confronted  by  several  targets,  all  of  which  may  be  obscured, 
camouflaged,  or  otherwise  difficult  to  acquire.  The  issues  of  how  to  model  target  acquisition  in 
such  an  environment  are  complicated  because  additional  assumptions  must  be  made  regarding 
what  the  observer’s  task  is,  how  search  progresses,  and  how  limiting  conditions  arise.  In 
addition,  models  must  be  altered  differently,  depending  on  whether  they  predict  individual  or 
ensemble  performance. 

Looking  first  at  the  task,  a  model  may  predict  the  probability  of  first  acquisition  (i.e.,  the 
probability  that  any  target  will  be  acquired)  or  the  probability  that  multiple  targets  are  acquired. 
The  simplest  solution  to  the  problem  of  multiple  targets  would  be  to  work  within  the  framework 


This  research  was  funded  by  Nissan  Motor  Corporation.  The  researchers  were  interested  in  the  effects  of 
material  splashed  onto  car  windshields  on  a  driver’s  ability  to  spot  important  changes  in  the  scene,  such  as  a  person 
stepping  into  the  roadway. 


63 


of  individual  perfonnance  prediction.  In  such  a  framework,  the  only  additional  assumption 
needed  would  be  a  specific  statement  of  search-quitting  criterion.  For  example,  first-detection 
performance  could  be  modeled  with  no  changes  in  a  single-target  model  except  that  the 
probability  of  a  target  within  a  glimpse  would  increase.  (Of  course,  depending  on  the  model,  the 
presence  of  multiple  targets  may  also  affect  factors  such  as  fixation  selection,  decision  criterion, 
etc.)  Predicting  multiple  acquisition  perfonnance  requires  the  simulated  observer  to  know  how 
many  targets  there  are  and  to  stop  after  they  are  all  acquired  or  to  place  a  time  limit  on  the  search 
process  and  let  it  continue  until  the  time  limit.  In  either  case,  the  model  must  keep  track  of 
targets  that  have  already  been  acquired  so  that  a  single  target  will  only  be  acquired  once.  (This 
addition  of  a  memory  component  to  search  is  built  into  some  models,  such  as  GTV,  but  is 
lacking  from  others,  such  as  Nicoll  &  Hsu’s  [1995]  model.) 

Search  models  from  perceptual  psychology  rarely  use  multiple  targets  except  as  a  test  of  the 
serial  or  parallel  nature  of  a  purported  search  process  by  examining  a  phenomenon  called 
redundancy  gain  (e.g.,  Egeth  &  Mordkoff,  1991)34.  The  lack  of  interest  in  multiple  target  search 
may  also  stem  from  the  fact  that  these  models  are  all  based  on  individual  performance  and,  as 
mentioned  before,  are  easily  extendible  to  multiple  target  situations. 

Predicting  individual  performance  in  a  static  detection  (i.e.,  non-search)  task  requires  the  targets 
in  question  to  be  within  the  observer’s  search  lobe.  That  condition  being  met,  assumptions  must 
be  made  regarding  limits  to  an  observer’s  perfonnance  in  the  task.  Decisions  made  on  the  basis 
of  signal  detection  theory  (e.g.,  based  on  SNR  or  SCR)  may  proceed  in  one  of  two  ways  in  multi¬ 
target  scenarios.  First,  the  task  may  be  redefined  as  several  independent  decisions  (one  for  each 
target)  with  a  logical  OR  detennining  the  probability  of  first  acquisition.  Second,  the  signal  and 
noise  terms  must  be  redefined  to  take  into  account  contributions  from  all  the  targets;  then  a 
single  global  decision  must  be  made  to  judge  if  the  signal  arose  from  a  target  (or  targets)  or 
noise. 

Predicting  ensemble  performance  in  a  static  search  is  considerably  more  daunting.  The  difficulty 
arises  from  how  asymptotic  perfonnance  terms  such  as  Poo  are  conceptualized.  That  is,  depending 
on  what  it  actually  means  that  a  particular  target  in  a  particular  scene  will  be  acquired  by  Poo  of  an 
ensemble  of  observers,  the  predictions  for  how  P,,  changes  with  the  number  of  targets  will  be 
different.  Six  possible  meanings  of  Poo  are  discussed. 

Rotman,  Gordon,  and  Kowalczyk  (1989)  considered  three  possible  reasons  why  ensemble 
detection  performance  is  imperfect,  given  infinite  time. 

1 .  The  ensemble  of  observers  is  strictly  ordered  in  terms  of  target  acquisition  competence. 

That  is,  some  of  them  are  simply  better  at  detecting  targets  than  others.  These  observers 


34The  issue  of  whether  visual  search  progresses  in  a  serial  or  parallel  manner  has  long  been  a  contentious  issue  in 
perceptual  psychology  (e.g.,  Palmer  &  McLean,  1995;  Townsend,  1971,  1990)  since  the  processes  underlying 
parallel  and  serial  search  differ  dramatically.  Redundancy  gain  is  but  one  technique  for  obtaining  data  that  may  be 
able  to  tease  apart  the  serial/parallel  distinction.  In  terms  of  target  acquisition  modeling,  the  difference  is  not  as 
important  because  the  very  fact  that  serial  and  parallel  processes  can  mimic  each  other  in  RT  or  accuracy  measures 
indicates  that  neither  type  of  model  is  likely  “better”  at  predicting  relevant  performance. 
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will  be  consistently  better,  regardless  of  the  target  or  background  (i.e.,  they  will  require 
fewer  cycles  on  target  to  acquire  the  target).  Silk  (1997)  refers  to  this  explanation  of  P*  as 
the  “observer-only”  account. 

2.  Observers  are  equivalent  in  a  statistical  sense,  but  the  responses  to  any  given  target  will  be 
stochastic  (within  the  bounds  that  an  ensemble  must  perform  at  a  level  of  Poo).  Some 
observers  will  confuse  the  target  with  background  clutter  and  will  not  evaluate  it  any 
further  while  other  observers  will  not  make  this  confusion.  In  this  case,  observers  who 
cannot  detect  a  target  in  one  situation  may  be  able  to  do  so  in  another.  Silk  (1997)  refers 
to  this  explanation  of  P*,  as  the  “observer-target”  account35. 

3.  Observer  performance  will  decrease  over  time  because  of  mental  weariness.  Some 
observers  are  able  to  acquire  the  target  within  a  critical  period  and  some  are  not. 

Rotman  et  al.  (1989)  derived  predictions  on  the  basis  of  these  three  assumptions  and  compared 
them  to  data  based  on  images  from  Hughes  Aerospace  (Scanlan  &  Agin,  1978).  The  proportion 
of  the  population  of  observers  who  were  able  to  detect  targets  of  varying  degrees  of  difficulty 
strongly  favored  either  explanation  2  or  3  over  explanation  1 .  The  authors  point  out,  however, 
that  the  number  of  observers  in  the  study  reduced  the  statistical  power  of  their  tests  to  the  point 
that  no  explanation  could  be  eliminated  definitively. 

In  addition  to  the  three  explanations  mentioned,  common  sense  tells  us  that  a  combination  of 
these  factors  is  probably  occurring:  Some  observers  are  better  than  others,  and  some  targets  (for 
reasons  unknown)  will  be  more  difficult  than  others,  regardless  of  how  facile  a  target  acquirer 
any  given  individual  is.  Whether  mental  weariness  comes  into  play  is  unclear.  Likely,  in  the 
case  of  testing  the  explanations  with  empirical  data,  weariness  would  not  be  a  factor,  given  the 
controlled  situations  in  which  the  data  were  collected.  Combinations  of  these  explanations  have 
been  tenned  “hybrid”  models  by  Silk  (1997). 

Silk  (1997)  analyzed  a  data  set  from  O’Kane,  Walters,  and  D’Angostino  (1993)  to  detennine 
whether  a  “hybrid”  explanation  of  Poo  could  be  based  on  the  detenninistic  observer-only  and 
observer-target  stochastic  processes.  He  found  that  those  two  factors,  plus  a  degree  of 
uncertainly  that  exists  as  a  result  of  uncertainty  in  target  signature  computation36,  completely 
defined  observer  performance.  In  other  words,  given  the  inherent  uncertainty  in  determining 
target  characteristics,  observer  performance  can  be  described  as  a  combination  of  observer-only 
and  observer-target  explanations. 

In  addition  to  Rotman  et  al.’s  (1994a)  explanations,  Nicoll  (1994)  put  forth  three  additional 
possibilities  for  Px  that  have  bearing  within  a  neoclassical  search  framework. 


The  Army  combat  model  JANUS  (not  an  acronym)  assumes  that  P,  is  purely  observer-target  based. 

Silk  (1995)  demonstrated  that  the  modeling  uncertainty  in  Johnson-like  models  is  statistically  unbiased.  That 
is,  the  uncertainty  in  predictions  of  target  detectability  is  independent  of  the  actual  detectability. 
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4.  There  is  another  state  in  the  search  process  called  “quit,”  and  the  corresponding  rate,  Q,  at 
which  this  state  is  entered  from  any  other  state.  The  probability  of  quitting  as  a  function  of 
time  is  then  very  much  like  the  probability  of  fixating  a  target  as  a  function  of  time,  except 
that  it  is  the  linear  combination  of  three  rather  than  two  exponentials.  The  number  of 
targets  detected  before  quitting  (rather,  the  distribution  of  such  trials)  detennines  P<». 

5.  The  number  of  visits  to  a  target  may  be  restricted  (possibly  because  of  a  temporal  cut-off  or 
a  moving  FOR). 

6.  Assume  that  the  Markov  process  is  not  memoryless  but  that  the  amount  of  information 
accumulated  during  visits  to  the  target  decreases  over  time.  If  the  asymptotic  amount  of 
information  obtainable  about  the  target  is  below  that  required  to  detect  it,  then  detection 
cannot  occur. 

Nicoll  (1994)  has  not  offered  any  data  to  support  any  explanation  over  the  others  but  presented 
them  as  examples  of  the  flexibility  of  the  neoclassical  framework. 

Other  issues  related  to  the  multi-target  scenario  relate  to  the  expectations  of  the  observer  and  the 
difference  between  the  targets.  For  example,  if  there  are  two  very  different  targets  in  the  scene, 
the  observer  must  know  that  there  are  two  (according  to  many  models  that  base  their  predictions 
on  a  known  target  representation)  or  must  base  his  search  on  a  general  metric  or  search  strategy 
that  makes  no  assumptions  about  the  appearance  of  the  targets.  Also,  if  the  subject  is  expecting 
to  see  or  has  been  trained  in  a  target-rich  scenario,  his  perfonnance  in  a  low-contrast  multi-target 
scenario  will  be  different  from  someone  trained  in  a  different  scenario  (e.g.,  Doll  &  Schmieder, 
1993).  Specifically,  the  former  observer  will  be  more  likely  to  hazard  many  false  alarms 
whereas  the  latter  will  be  more  conservative.  More  is  said  about  dependent  measures  other  than 
Pd  in  a  later  section. 

Classic  studies  in  perceptual  psychology  have  shown  that  if  the  various  targets  are  similar  to 
each  other  in  appearance  and  are  different  from  non-targets  in  appearance,  then  little  training  will 
be  required  for  search  performance  in  the  multi-target  situation  to  be  as  good  as  in  the  single¬ 
target  situation  (Schneider,  Dumais,  &  Shiffrin,  1984).  However,  as  targets  become  different 
from  each  other  and  more  similar  to  non-targets,  training  will  take  much  longer  to  achieve  the 
same  level  of  performance  (Schneider  et  al.,  1984;  Duncan  &  Humphreys,  1989). 

7.3  Blur,  Noise,  and  Obscurants 

Different  factors  limit  human  target  acquisition  performance  in  threshold  and  super-threshold 
situations.  At  or  near  threshold,  human  performance  is  noise  limited;  above  threshold,  human 
performance  is  contrast  limited  (Lloyd  &  Sendall,  1970).  The  role  of  noise  in  target  acquisition 
is  not  limited  to  the  threshold  of  our  sensory  system,  however.  The  same  limits  to  detection  of 
visible  form  apply  when  noise  is  relative  to  signal  strength.  That  is,  when  noise  is  high,  human 
perception  is  noise  limited;  when  noise  is  low,  human  performance  is  contrast  limited. 
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Noise  in  target  acquisition  comes  from  a  variety  of  sources:  the  absolute  threshold  of  vision  for 
a  dark-adapted  observer  is  determined  partly  by  quantum  noise  (probabilistic  absorption  of 
photons  by  photochemical  molecules)  and  neural  noise  (photochemical  breakdowns  and  firing  of 
neurons  in  the  visual  system).  For  the  military  observer,  visual  noise  of  interest  typically  comes 
from  the  display  or  the  sensor  on  and  through  which  he  is  viewing  an  image  of  the  scene. 

Noise  varies,  depending  on  the  type  of  sensor.  FLIR  sensors  are  susceptible  to  noise  from  IR 
atmospheric  emissions  and  scatter,  thermal  noise  within  the  sensor,  and  scintillation  noise 
because  of  turbulence  of  the  air  along  the  line  of  sight  of  the  sensor.  The  latter  noise  can  take  the 
form  of  blur  (the  loss  of  high  spatial  frequency  information)  if  the  integration  time  of  the  sensor 
is  long  or  motion  artifacts  (small  moving  images  that  do  not  correspond  to  objects  moving  in  the 
field)  if  the  integration  time  of  the  sensor  is  brief  and  its  spatial  resolution  is  high.  Image 
intensifies  do  not  suffer  from  so  many  sources  of  noise  because  the  wavelengths  of  light 
intensified  by  the  sensor  do  not  interact  so  readily  with  particulate  matter  in  the  air  column37. 

The  effects  of  atmospheric  noise  in  FLIR  sensors  are  well  understood  and  modeled  quite 
effectively  (e.g.,  the  TARGAC  front  end  for  NVESD  static  detection  models).  However,  the 
effect  of  noise  on  human  decision  making  is  not  as  clear. 

Blur,  the  loss  of  fine  spatial  detail  (i.e.,  an  attenuation  of  high  spatial  frequency  information),  is 
well  understood,  in  theory  at  least.  An  across-the-board  degradation  in  performance  is  expected 
for  all  levels  of  target  acquisition  because  the  loss  of  detail  is  akin  to  a  reduction  of  contrast  of 
targets  to  the  point  that  the  modulation  of  their  fine  details  falls  below  threshold.  Blur  can  be 
instantiated  in  a  model  with  a  digital  blur  operation  on  an  input  image  (such  as  Gaussian  blur)  or 
by  modulation  of  the  Fourier  components  of  an  image  with  a  high- frequency-attenuated 
modulation  transfer  function.  The  resulting  decrease  in  effective  contrast  can  be  traced  along  a 
TTPF  to  detennine  the  concomitant  loss  in  performance. 

Aleva  and  Kupennan  (1997)  evaluated  the  effects  of  various  kinds  of  scene  degradation  on  the 
detection  and  recognition  of  a  variety  of  Army  vehicles  at  various  ranges  using  a  signal  detection 
paradigm38.  The  authors  manipulated  scenes  by  increasing  scene  modulation  (reduction  in 
contrast),  blur,  and  white  noise.  The  authors  noted  two  effects  of  significance:  first,  modulation 
and  blur  interacted  (as  one  would  expect).  Second,  the  effect  of  blur  and  modulation  was 
manifested  as  a  decrease  in  hit  rate  only;  false  alarm  rate  remained  constant.  From  these  results, 
it  was  concluded  that  the  sensitivity  of  the  observer  was  decreasing  as  a  result  of  the  image 
degradations.  Such  a  result  is  consistent  with  the  loss  of  information  or  in  signal  detection  terms, 
the  decrease  in  SNR  in  conditions  of  blur  and  modulation. 


37Stereoscopic  image  intensifiers,  with  an  intensifier  tube  for  each  eye,  are  even  less  susceptible  to  noise.  The 
scintillation  noise  in  the  tubes  is  uncorrelated,  and  the  visual  system  has  little  trouble  discounting  it  from  the 
otherwise  stereoscopic  image  of  the  scene.  The  per-item  cost  of  such  systems  remains  prohibitive,  however. 

The  manuscript  by  Aleva  and  Kuperman  serves  as  an  excellent  review  of  basic  visual  psychophysics  and  of 
signal  detection  theory. 
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Obscurants  make  target  acquisition  difficult  by  blocking  the  electromagnetic  radiation  reflected 
or  emitted  by  the  target  so  that  it  is  never  detected  by  a  sensor.  Unlike  noise  or  blur,  however, 
obscurants  have  a  temporal  character  since  the  consistency,  density,  and  amount  of  obscurant 
between  the  sensor  and  the  target  are  not  unifonn  over  time.  Rotman,  Gordon,  and  Kowalczyk 
(1991)  extended  the  NVESD  static  detection  model  to  account  for  time -varying  obscurant  smoke 
by  assuming  that  as  the  smoke  obscures  more  target  infonnation,  the  proportion  of  observers 
who  will  be  able  to  detect  the  target  will  decrease.  The  authors  modeled  the  performance  for  an 
ensemble  by  estimating  a  mix  of  perfonnance  for  an  unobscured  target  and  a  steady  state 
obscured  target.  The  main  predictions  of  the  model  are  that  time-varying  obscurant  performance 
will  reach  an  asymptote  at  the  level  for  an  unobscured  target.  The  model  has  been  applied  to 
engineer  specifications  of  several  fielded  FL1R  sensor  systems,  but  it  has  not  yet  been  validated 
by  human  data  from  field  tests. 

7.4  Measures  of  Performance  Other  Than  Pd 

The  most  influential  modeling  concept  in  the  past  40  years  has  been  the  Johnson  criteria  and  the 
corresponding  TTPF.  The  resulting  static  target  discrimination  model  incorporated  into  several 
NVESD  models  is  considered  one  of  the  most  common  (and  most  effective)  models  for 
ensemble  performance.  However,  the  model  only  makes  predictions  of  a  single  variable  in  a 
single  type  of  situation:  Pd,  the  probability  of  detecting  a  target  when  one  is  present.  Other 
performance  measures  can  be  inferred  if  one  makes  assumptions  about  how  a  decision  is  made 
(e.g.,  Rotman  et  ah,  1991),  but  the  Johnson  criteria  are  by  themselves  limited  in  how  they  infonn 
us  about  the  process  of  target  acquisition. 

Over  the  years,  different  tasks  and  different  analyses  have  led  to  several  ways  of  characterizing 
observer  performance  in  target  acquisition  tasks.  This  section  discusses  a  number  of  them: 
Schmieder  and  Weathersby’s  (1983)  Pacq  measure,  the  false  detection  percentage  (FDP), 
response  time,  the  detenninants  of  perfonnance  according  to  signal  detection  theory  (Phit,  Pfa 
(FAR),  d',  A',  and  P),  and  real-time  eye  movement  data. 

Although  popular  models  such  as  ACQUIRE  predict  search  performance  over  time,  they  do  so 
by  predicting  Pd  as  a  function  of  time  only.  Such  a  measure  is  useful,  especially  for  war  game 
simulation  in  which  it  is  important  to  predict  the  detectability  of  a  target  when  only  a  certain 
amount  of  time  is  available  to  scrutinize  the  scene.  However,  this  measure  in  and  of  itself  is 
limited  in  how  well  it  predicts  overall  observer  behavior.  The  primary  shortcoming  of  the 
measure  is  that  it  does  not  address  observer  false  alarms  (i.e.,  reporting  a  target  when  none  was 
present).  False  alarms,  also  called  false  detections  in  some  analyses,  can  be  further  subdivided 
into  cases  when  no  target  was  present  and  cases  when  a  target  was  present  but  the  observer 
mistakenly  reported  that  a  non-target  element  was  the  target.  Such  a  distinction  can  be  made 
when  an  observer  is  forced  to  localize  a  target  in  a  scene  in  addition  to  reporting  merely  its 
presence,  or  it  can  be  inferred  from  observer  response  and  eye  movement  data.  (The  potential 
value  of  eye  movement  information  is  discussed  shortly.) 
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In  their  analysis  of  observer  performance  in  cluttered  environments,  Schmieder  and  Weathersby 
(1983)  detennined  that  Pd  might  not  always  be  a  meaningful  measure  since  a  high  rate  of  false 
alarms  is  typically  observed  in  conditions  of  high  clutter.  The  authors  proposed  instead  the 
measure  Pacq,  the  probability  of  acquisition,  defined  as  the  probability  that  an  observer  can 
correctly  acquire  a  target  after  n+1  investigations  in  which  n  false  targets  were  first  correctly 
rejected: 


acq 


^Pd\l-P{FA)\ 


in  which  P(FA)  =  fixed  probability  of  false  alarm  (based  on  clutter  level), 

Pd  =  probability  of  detection,  and 

n  =  the  number  of  objects  investigated  when  a  target  is  located. 

This  measure  of  perfonnance  assumes  that  clutter  attracts  eye  movements  in  a  discrete  manner. 

It  also  presumes  that  clutter's  effect  is  on  the  probability  of  false  alarms  and  to  a  lesser  extent,  on 
the  probability  of  detection,  as  detennined  by  the  SCR. 

A  further  problem  with  using  response  time  and  a  single  accuracy  measure  (e.g.,  Pd)  is  that  it 
ignores  how  an  observer  makes  a  decision.  For  example,  a  speed-accuracy  trade-off  may  occur. 
Speed-accuracy  trade-offs  result  when  an  observer  with  a  lax  criterion  for  deciding  that  a  target 
is  present  responds  faster  and  makes  more  errors  than  an  observer  with  a  more  stringent  criterion, 
who  responds  more  slowly  and  makes  fewer  errors.  This  pattern  of  errors  and  response  times 
may  occur  even  if  the  observers  are  equally  good  at  detecting  the  target.  The  difference  in 
decision  criterion  not  only  varies  between  observers  (see,  e.g.,  Rotman,  Gordan,  &  Kowalczyk, 
1989,  for  an  analysis  of  perfonnance  based  on  this  assumption)  but  within  observers  as  a 
function  of  training,  stress,  fatigue,  expectation,  the  costs  and  benefits  (“payoffs”)  of  rendering  a 
decision,  and  concurrent  task  load.  As  such,  it  is  impossible  to  determine  how  sensitive  an 
observer  is  to  the  presence  of  a  target  by  looking  solely  at  RT  and  Pd. 

The  method  used  most  often  to  separate  the  contributions  of  observer  sensitivity  and  criterion  in 
making  a  decision  is  called  Signal  Detection  Theory  (SDT)  (Green  &  Swets,  1966).  Briefly, 
signal  detection  theory  asserts  that  the  detection  of  a  signal  requires  an  observer  to  be  able  to 
distinguish  between  noise  inherent  in  the  sensory  system  and  a  signal  added  to  that  noise.  Signal 
and  noise  distributions  are  assumed  to  be  normal  and  have  equal  variance.  An  observer  bases  his 
decisions  on  sensitivity  (his  visual  system’s  ability  to  distinguish  between  the  noise  and  the 
signal-plus-noise  distributions)  and  the  criterion  that  he  sets  for  detennining  if  a  given  sensory 
signal  arose  from  the  signal  or  noise  distribution.  A  sensory  signal  whose  strength  is  above  the 
criterion  will  be  reported  as  a  signal;  one  whose  strength  falls  below  the  criterion  will  be  reported 
the  absence  of  a  signal.  See  MacMillan  and  Creelman  (1991)  for  an  excellent  introduction  to 
SDT. 
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The  assumption  that  signal  and  noise  distributions  are  normally  distributed  (with  the  same 
standard  deviation)  allows  the  decision  criterion,  [3,  to  be  separated  from  the  observer  sensitivity, 
d'39.  In  order  to  perform  an  SDT  analysis,  the  hit  rate,  defined  as  P(target  reported|target 
present),  which  is  the  same  as  Pd  for  a  single  subject)  and  the  false  alarm  rate  (FAR),  defined  as 
P(target  reported|target  absent),  is  needed.  Further  assumptions  may  be  needed  for  us  to  perfonn 
SDT  analysis  on  data  in  which  false  alanns  include  non-targets  misidentified  as  targets  when 
targets  were  present  elsewhere. 

SDT  has  been  used  extensively  to  examine  target  acquisition.  (For  a  more  thorough  review  of 
SDT  analysis  as  it  applies  specifically  to  target  acquisition,  see  Wilson,  1992.)  Of  particular 
interest  is  how  false  alarms  are  affected  by  various  factors.  As  mentioned  earlier,  Aleva  and 
Kupennan  (1997)  used  SDT  to  evaluate  the  effects  of  modulation,  blur,  and  noise  on  target 
acquisition  performance.  Their  results  showed  a  decrease  in  hit  rate  but  no  change  in  FAR  as 
scene  quality  decreased,  indicating  that  subjects  in  the  study  did  not  shift  their  criteria  but  were 
becoming  less  sensitive  to  the  targets. 

Doll  and  Schmieder  (1993)  were  the  first  study  to  look  at  the  effects  of  clutter,  as  measured  by  a 
quantitative  metric,  on  false  alann  rate40.  The  authors  used  a  measure  of  clutter  called  the  SCR, 
which  is  related  to  the  gray-level  statistical  variance  metric  (see  the  section  of  this  report  on 
clutter  and  conspicuity  for  details).  The  authors  looked  at  overall  probabilities  of  detection  and 
FAR  and  found  that  as  SCR  decreases  (i.e.,  as  clutter  increases),  observers  shift  their  criterion  to 
produce  more  “target  present”  responses,  thus  increasing  the  FAR. 

Grossman,  Hadar,  Rehavi,  and  Rotman  (1995)  also  used  SDT  to  investigate  how  clutter  affects 
FAR.  The  authors  defined  noise  to  be  the  strength  of  a  clutter  metric  (the  probability  of  edge 
metric  or  Schmieder  &  Weathersby’s  [1993]  SCR  metric)  and  modeled  search  performance  over 
time  as  a  function  of  per-glimpse  SCR.  Glimpses  were  assumed  to  be  independent  and  attracted 
to  regions  of  high  clutter.  Their  results  indicated  that  the  average  accumulated  number  of  false 
alarms  increased  as  a  linear  function  of  clutter,  the  slope  of  which  was  determined  by  the  time 
permitted  for  search.  That  is,  the  false  alann  rate  within  each  glimpse  was  constant.  The 
difference  between  their  results  and  those  of  Doll  and  Schmieder  were  likely  attributable  to 
assumptions  made  by  Doll  and  Schmieder  to  predict  overall  FAR  rather  than  examining  FAR  as 
a  function  of  search  time.  The  hit  rate  (Pd)  decreased  as  a  function  of  clutter,  indicating  that 


7Q 

If  normality  is  known  to  be  violated  or  cannot  be  evaluated  directly  by  normalized  receiver  operating 
characteristic  curves  (see  MacMillan  &  Creelman,  1991),  then  a  non-parametric  measure  of  sensitivity.  A’,  may  be 
calculated  (Pollack  &  Norman,  1964): 

A'=± - - r - 

~PFA) 

40The  Doll  and  Schmieder  (1993)  paper  contains  a  good  introduction  to  SDT  and  how  it  applies  to  target 
acquisition  in  cluttered  environments.  The  paper  also  addresses  the  effects  of  display  resolution  and  its  interactions 
with  clutter. 
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while  subjects  kept  their  criteria  relatively  constant,  they  were  less  sensitive  to  targets  that 
appeared  in  cluttered  scenes.  The  authors  also  argue  that  time  reduces  the  decision  criterion. 

Silk  (1995a)  argues  that  scene -based  characteristics  such  as  blur  and  clutter  are  not  the  only 
detenninants  of  observer  decision  threshold.  In  a  study  involving  detection  of  altered  IR  target 
signatures  (i.e.,  digitally  modified  to  reduce  the  signature),  observers  were  more  likely  to 
generate  false  alarms  in  a  test  situation  if  they  had  been  trained  in  a  target-rich  environment.  The 
hit  rate  of  the  observers  was  the  same  across  training  situation,  indicating  that  observers  shifted 
their  decision  criteria  downward  when  they  thought  more  objects  in  the  scene  were  likely  to  be 
targets. 

In  addition  to  being  able  to  disentangle  the  effects  of  sensitivity  and  decision  criterion,  the  use  of 
methods  amenable  to  signal  detection  analysis  has  advantages  of  its  own.  First,  such  methods 
are  likely  to  be  standardized  across  studies,  so  researchers  may  be  better  able  to  relate  their 
theories  and  analyses  to  existing  data  rather  than  having  to  run  additional  studies.  Also,  forcing 
observers  to  perfonn  a  two-alternative  forced  choice  (2AFC)  or  a  detection-plus-confidence  task 
rather  than  simple  go/no-go  detection  task  or  deliberately  manipulating  pay-offs  for  the  different 
types  of  errors  (misses  and  false  alarms)  gives  the  experimenter  additional  infonnation  about  the 
nature  of  the  discrimination.  Valeton  and  Bijl  (1995)  found,  for  example,  that  subject 
performance  was  better  in  a  2AFC  task  (picking  which  of  two  trials  contained  a  target)  than  a 
go/no-go  task  (only  reporting  if  a  target  is  seen). 

A  FAR-like  measure,  the  FDP,  has  been  used  successfully  within  the  framework  of  the 
ACQUIRE  model  to  explain  the  variability  in  N50  with  scene  clutter.  FDP  is  defined  as 

#  "  present"  responses  I  no  target  ,  „  , 

FDP  =  — - - - 1 - 2 —  x  1 00% 

total  #  " present "  responses 

Mazz  (1998)  noted  that  much  of  the  variability  noted  in  empirical  N50  resulting  from  different 
levels  of  clutter  can  be  accounted  for  if  one  also  accounts  for  the  false  detection  percentage. 

FDP  is  analogous  to  and  largely  independent  from  N50.  That  is,  FDP  and  N50  can  vary  freely 
within  a  study,  indicating  that  both  quantities  should  be  taken  into  account  when  one  is 
performing  an  analysis  of  the  effect  of  clutter41. 

Eye  movement  data  from  search  tasks  are  an  often-overlooked  source  of  information  for  how 
subjects  perfonn  target  acquisition  experiments.  Eye  movements  during  search  can  provide 
insight  into  (a)  the  evaluation  of  local  metrics  of  clutter,  conspicuity,  distinctness,  and 
attractiveness;  (b)  evaluating  model  parameters  such  as  glimpse  aperture  and  glimpse  duration; 
(c)  determining  whether  the  classical  or  neoclassical  search  framework  provides  a  better  fit  to 
overt  behavior.  As  has  been  mentioned  elsewhere  in  this  report,  eye  movements  during  search 
tend  toward  regions  of  the  scene  that  are  “target  like.”  Several  metrics  have  been  proposed  to 

4 'Though  this  result  is  not  surprising  given  the  independence  of  hit  rate  and  false  alarm  rate  in  SDT,  it  is 
interesting  that  such  a  result  holds  in  the  case  of  N50  and  FDP  in  that  both  are  ensemble  performance  measures. 
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determine  the  attractiveness  (or  distinctiveness,  or  conspicuity)  of  various  regions  of  the  scene. 
The  evaluation  of  these  metrics  is  almost  always  performed  by  an  examination  of  the  degree  to 
which  eye  movements  about  the  scene  tended  to  land  on  regions  that  score  highly  on  a  metric 
(e.g.,  Tidhar  et  ah,  1994;  Rotman,  Kowalski,  &  George,  1994;  Toet,  1996;  Cartier,  Nicoll,  & 
Hsu,  1998).  Also,  the  evaluation  of  search  models  that  posit  a  fixation  guidance  mechanism 
based  on  regions  of  the  scene  that  are  likely  to  contain  the  target  (e.g.,  Doll  et  ah,  1998)  can  be 
aided  by  an  evaluation  of  eye  movements.  By  examining  the  spacing  of  eye  movements  and 
how  that  spacing  changes  as  a  function  of  clutter,  we  can  obtain  information  about  the  size  of  a 
glimpse  aperture,  whether  soft-  or  hard-shell  search  is  occurring,  and  any  effects  of  clutter  on 
glimpse  parameters.  Finally,  looking  at  the  degree  to  which  fixations  return  to  previously  visited 
regions  of  the  scene  and  when  during  searching  a  decision  is  made  can  corroborate  or  disprove 
predictions  of  the  classical  and  neoclassical  search  models. 

7.5  Validation  Issues 

O’Kane  has  written  an  excellent  overview  of  the  process  of  target  acquisition  model 
development  and  validation  (1995).  The  author  specifies  and  gives  concrete  examples  of  three 
different  methods  and  the  roles  they  play  in  the  process  of  model  development:  (a)  perceptual 
experiments  using  hybrid  imagery,  (b)  perceptual  experiments  using  calibrated  field  imagery, 
and  (c)  field  trials  controlled  and  documented  as  well  as  possible.  The  discussion  herein  focuses 
on  the  second  of  these  three  steps,  as  the  models  considered  in  this  report  were  arguably  past  the 
point  of  using  hybrid  imagery  to  test  their  underlying  theories.  At  the  same  time,  though,  the 
authors  (wisely)  chose  not  to  put  forth  the  risk  and  expense  required  for  field  trials.  If  the  field 
imagery  is  calibrated  sufficiently  and  all  relevant  observer,  task,  and  dependent  variables  are 
recorded  in  detail,  then  much  can  be  learned  about  target  acquisition  without  our  leaving  the 
laboratory.  (Of  course,  field  experiments  will  be  required  to  validate  major  models,  especially  if 
the  models  predict  an  effect  of  a  variable,  such  as  observer  stress,  that  cannot  be  readily 
manipulated  in  a  laboratory  setting.) 

As  alluded  to  earlier,  evaluation  of  scene  metrics  and  models  of  target  acquisition  perfonnance 
depends  on  the  existence  of  a  standardized  data  set  of  images,  tasks,  observer  variables,  and 
performance  measures.  The  generalizability  of  models  is  determined  by  the  underlying 
psychophysical  data  upon  which  the  models  are  based.  The  validation  of  parts  of  models  such 
as  ORACLE  or  GTV  depend  on  a  database  of  psychophysical  results.  Models  of  vision  are 
currently  constrained  by  the  lack  of  a  readily  available  database  of  stimuli,  methods,  and 
threshold42. 

A  useful  data  set  for  target  acquisition  development  and  validation  must  contain  four  things: 

(a)  standardized,  calibrated  stimuli  with  complete  descriptions  of  the  scene  geometry, 


49 

A  special  interest  group  at  the  1999  Annual  Meeting  of  the  Association  for  Research  in  Vision  and 
Ophthalmology  (ARVO)  called  for  the  creation  of  such  a  database  of  thresholds  and  called  for  its  availability  on  the 
internet. 
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atmospheric  conditions,  and  scene  manipulations,  (b)  information  about  the  task  that  observers 
must  perfonn,  (c)  observer  variables  and  how  they  were  measured,  and  (d)  the  perfonnance 
measures  used  and  subject  performance  data. 

Although  it  is  a  relatively  simple  task  for  basic  vision  science,  since  the  optical  spectrum,  the 
unaided  eye,  and  established  psychophysical  measures  are  of  interest,  creating  such  a  database 
for  use  in  military  target  acquisition  research  represents  a  more  daunting  challenge.  One  reason 
for  the  difficulty  (and  the  need)  is  that  different  sensors  have  different  specifications,  any  of 
which  may  be  important  in  the  detennination  of  observer  behavior.  In  addition,  observer  tasks, 
levels  of  target  acquisition  desired,  and  dependent  measures  (e.g.,  RT,  Pd,  FAR,  eye  movement) 
will  differ  greatly.  In  order  for  us  to  grasp  observer  variables,  much  data  about  subject  training, 
levels  of  fatigue,  concurrent  task  load,  etc.,  must  also  be  collected.  The  performance  measures 
should  include  ensemble  and  individual  data,  preferably  with  sufficient  detail  that  different 
analyses  can  be  perfonned  on  the  same  data  set  (e.g.,  SDT  analysis  can  be  perfonned  on  data 
from  an  ensemble-performance  study).  Individual  data  in  ensemble  studies  are  of  particular 
interest  because,  as  pointed  out  by  Rotman  et  al.  (1989),  the  reason  why  ensemble  performance 
predictors  such  as  Poo  have  the  values  they  do  remains  unknown. 


8.  Prognostication:  The  Future  State-of-the-Art  Target  Acquisition  Model 


This  section  describes  the  current  state  of  the  art  and  where  modeling  is  headed.  This  final 
section  discusses  the  author’s  thinking  in  terms  of  the  most  profitable  avenues  to  be  pursued  in 
target  acquisition  modeling. 

There  is  no  clear  state-of-the-art  target  acquisition  model.  Some  models  do  a  good  job  of 
predicting  perfonnance  in  general  but  do  not  incorporate  many  factors  known  to  influence 
performance  (e.g.,  ACQUIRE,  FL1R92).  Other  models  incorporate  many  such  factors  but  have 
so  many  degrees  of  freedom  that  their  applicability  to  a  given  situation  may  be  questionable 
(e.g.,  ORACLE,  GTV).  Although  there  is  little  benefit  to  having  a  single  model  that  accounts  for 
everything  as  opposed  to  several  models  that  each  account  for  a  piece  of  the  target  acquisition 
pie,  there  is  undoubtedly  a  benefit  to  models  that  take  more  than  a  single  factor  into  account. 

The  need  for  a  multi-factor  approach  to  target  acquisition  modeling  comes  from  various  lines  of 
evidence.  First,  studies  by  Mazz,  Kistner,  and  Pibil  (1998),  and  Meitzler,  Kistner  et  al.  (1998) 
demonstrated  that  the  effects  of  variables  such  as  scene  clutter  and  target  velocity,  range,  and 
contrast  had  effects  on  performance  independently  and  as  interactions.  Second,  many  commonly 
studied  and  validated  metrics  of  clutter  and  conspicuity  are  based  on  the  co-occurrency  matrix, 
which  incorporates  structure  as  well  as  contrast  in  determining  what  parts  of  a  scene  are  target 
like  or  are  based  on  measures  that  take  into  account  more  than  one  scene  factor  at  a  time  (e.g., 
CAMELEON).  Third,  from  the  perceptual  psychology  literature,  it  is  known  that  contrast  alone 
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does  not  detennine  salience  or  attentional  capture.  Rather,  it  interacts  with  factors  such  as 
motion,  transient  visual  events,  and  color. 

The  trick  will  be  to  incorporate  the  various  factors  in  a  way  that  makes  sense  and  provides  a  good 
analog  to  the  cues  that  the  human  visual  system  used  to  perfonn  target  acquisition.  Meitzler, 
Kistner,  et  al.  (1998)  and  Meitzler,  Singh,  et  al.  (1998)  used  a  fuzzy  logic  approach  to  incorporate 
several  factors  into  the  ACQUIRE  model.  (Recall  that  this  model  is  based  on  the  Johnson  criteria.) 
The  result  of  the  study  were  sets  of  fuzzy  rules,  gleaned  from  half  of  a  human  performance  data  set 
and  applied  to  the  other  half,  which  predicted  more  than  90%  of  the  variance  in  performance.  This 
result  raises  two  questions:  Can  the  rules  from  one  such  study  can  be  applied  more  generally  to 
other  studies?  Why  did  the  rules  arise  the  way  they  did?  The  first  question  is  a  practical  matter 
since  it  applies  only  to  models  within  the  ACQUIRE  framework.  The  second  question  is  more 
interesting.  What  is  it  about  the  target  acquisition  situations  in  the  study  that  prompted  observers 
to  use  some  factors  in  one  case  and  other  factors  in  another? 

This  reviewer  is  convinced  that  a  theoretically  driven  research  program  into  how  human 
observers  use  information  in  the  scene  will  allow  general  rules  to  be  derived  for  integrating 
multiple  factors  in  future  models.  The  starting  place  for  such  a  program  should  be  an  aspect  of 
visual  perception  that  is  well  understood  in  theory  and  has  been  shown  to  have  an  impact  on 
search  and  detection.  One  possibility  would  be  to  investigate  the  role  played  by  selective 
attention  in  real-world  target  acquisition  and  the  observer  and  scene-based  factors  that  influence 
the  deployment  of  attention.  A  team  at  ARL’s  Human  Research  and  Engineering  Directorate  is 
endeavoring  to  study  attention  in  just  such  a  way.  With  a  principled  understanding  of  the  role  of 
attention  and  the  influences  on  attention,  models  may  be  modified  or  developed  to  include 
known  effects  of  measurable  factors. 

Current  models  best  able  to  accommodate  the  effects  of  selective  attention  are  models  of 
individual  rather  than  ensemble  performance.  GTV,  in  particular,  already  contains  modules  to 
prioritize  and  guide  eye  movements  based  on  attention  and  to  include  training.  Incorporating 
attention  into  a  neoclassical  framework  model  would  require  a  non-random  search  step  that 
dramatically  complicates  calculations.  Fitting  attention  into  a  Johnson  criteria-based  model  also 
presents  somewhat  of  a  challenge,  since  there  are  so  few  free  parameters  to  work  with. 
(Presumably,  N50  or  the  shape  of  the  TTPF  may  be  modulated  by  attentional  parameters.) 

It  is  readily  acknowledged  that  regardless  of  the  emphasis  placed  on  multi-factor  approaches  to 
target  acquisition  modeling,  the  Johnson  criteria  and  models  based  on  it  will  not  go  away.  It  is 
therefore  important  to  determine  the  extent  to  which  models  based  on  the  criteria  can  be 
extended  to  include  additional  factors.  NVESD’s  static  performance  models  have  undergone 
such  scrutiny  in  an  attempt  to  see  if  they  can  accommodate  multiple  observers,  multiple  targets, 
clutter,  false  detection  predictions,  and  the  presence  of  scene  obscurants.  Analyses  such  as  the 
one  by  Silk  (1995b,  1997)  should  be  emphasized  before  we  attempt  to  encompass  additional 
variables  in  such  models. 
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Appendix  A.  Models  and  Modeling  Concepts  of  Interest 


This  appendix  contains  descriptions  of  models  that  have  been  influential  or  discussed  in  detail  in 
the  report.  Models  are  discussed  in  terms  of  where  they  fall  on  the  five  classification  axes,  how 
they  function,  what  they  predict,  their  relations  to  the  topics  of  interest,  and  a  critique  of  their 
strengths  and  weaknesses. 

British  Aerospace  ORACLE  Model  (Overington,  Brown,  &  Clare,  1977;  Cooke,  Stanley,  & 
Hinton,  1995) 
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CAUTIONARY  NOTE:  The  ORACLE  model  is  proprietary  to  British  Aerospace  and  (so  far  as 
this  reviewer  can  determine)  has  never  been  published  in  toto.  The  model  consists  of  several 
modules  for  performing  specific  visual  tasks  such  as  motion,  color,  depth,  etc.  The  following 
description  is  for  the  general  ORACLE  framework  and  its  application  in  search  and 
discrimination  of  achromatic,  static,  luminance-defined  targets. 

Basic  operating  principles: 

•  Focus  is  on  the  known  physiology/anatomy  of  visual  system,  primarily  the  optics  of 
the  eye  and  the  anatomy  of  the  retina. 

•  The  model  bases  its  predictions  on  retinal  image  of  elements  of  the  scene. 

•  Assumption:  Edges  of  a  target  rather  than  the  total  energy  within  it  are  significant. 

•  Threshold  detection  is  therefore  based  on  strength  of  signal  arising  from 
luminance  gradients  across  adjacent  retinal  receptors. 

•  Signal  strength  must  exceed  a  noise  term  for  a  decision  to  be  made. 

Flow  of  processing: 

•  Mean  scene  luminance  is  used  to  determine  the  level  of  adaptation  of  the  visual 

system. 

•  Mean  scene  luminance  and  field  of  view  detennine  pupil  diameter. 

•  Pupil  size  and  non-linear  optical  properties  of  eye  structures  determine  point  spread 
function  and  modulation  transfer  function  of  the  eye's  optics. 

•  The  point  spread  function  detennines  how  a  target  image  of  a  particular  size  and 
luminance  contrast  (including  edge  gradient  or  sharpness)  is  represented  as  an  image  on  the 
retina. 

•  The  sum  of  the  activity  of  photoreceptors  around  the  edge  of  the  target  constitutes  the 

signal. 


91 


•  All  of  the  above  processes  are  based  on  known  anatomical  and  psychophysical 
properties  of  the  visual  system  and  include  such  factors  as  eccentricity,  photopic  and  scotopic 
acuity,  the  distribution  of  retinal  receptors,  vertical/horizontal  asymmetries  in  acuity. 

How  search  is  characterized: 

•  Glimpse  duration  is  constant  (1/3  of  a  second). 

•  Glimpse  locations  are  independent.  (I.e.,  random  sampling  with  replacement.) 

•  Search  progresses  in  soft-shell  manner. 

•  Soft  shell  characteristics  are  modeled  as  a  distribution  of  population  (i.e.,  known)  hard 
shell  sizes. 

•  Clutter  causes  soft  shell  distribution  to  lean  more  towards  smaller  shells. 

How  detection  is  characterized: 

•  Detection  is  based  on  Ricco’s  law  (i.e.,  that  threshold  contrast  of  a  target  is 
proportional  to  its  area). 

How  recognition  is  characterized: 

•  Based  on  ability  to  resolve  detail  within  the  target  signature  (i.e.,  detectable  changes  in 
the  perimeter  of  the  target). 

•  Two  adjacent  features  must  be  resolvable  for  discrimination  to  be  possible. 

How  color  is  characterized: 

•  NOTE:  The  available  documentation  on  the  model  did  not  go  into  detail  although  the 
authors  do  acknowledge  that  ORACLE’S  predictions  related  to  color  conspicuity  are  accurate 
(see  below). 

•  Color  in  ORACLE  is  based  on  R  and  G  cones  only,  using  cone  response  sensitivity 

data. 

ORACLE  framework  can  be  used  to  calculate  response  to  Johnson-like  bar  patterns  by 
determining  point  spread  of  constituent  Fourier  components  (odd  sinusoids)  of  the  bar  pattern’s 
square  waveform. 

•  Four  spatial  scales  (analogous  to  responses  from  sets  of  single,  3,  9,  and  27  adjacent 
photoreceptors)  are  incorporated  into  the  model  in  order  to  accommodate  psychophysical  results 
related  to  the  overall  contrast  sensitivity  function  of  the  eye. 

NOTES: 

•  Motion  not  included  in  model. 

•  Model  has  not  been  validated  in  general  -  only  piecewise  agreement  with 
psychophysics. 

•  There  is  no  sufficient  database  with  which  to  validate  the  model  (authors). 

•  More  interested  in  optics  of  the  eye  than  other  models. 

Targets  are  assumed  to  be  larger  than  9  arc  min  in  diameter. 

•  The  model  assumes  that  edges  are  important,  yet  it  is  able  to  model  Ricco’s  Law  for 
small  targets.  This  is  only  possible  for  targets  that  are  not  elongated;  otherwise  the  edge-based 
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signal  might  be  stronger  than  the  area-based  signal.  Possibly  small  targets  are  blurred  so  that  this 
is  not  a  problem? 

•  ORACLE'S  visibility  (target  SNR)  was  compared  to  human  ratings  of  conspicuity  of 
colored  targets  against  colored  backgrounds  (Johnson,  1990).  The  model  agreed  well  with 
human  data.  However,  no  specifics  were  given  as  to  how  color  was  included  in  the  model  tested. 

CRITIQUE: 

•  Model  suffers  from  lack  of  fixation  guidance  mechanism.  Effects  of  clutter  only  alter 
attributes  of  soft  shell  lobe. 

•  **  Top-down  characteristics  are  not  implemented  into  model  because  they  govern  how 
search  will  progress  through  a  particular  scene.  The  authors  argue  that  “. .  .the  effort  in  modeling 
at  an  equivalent  level  of  detail  is  far  greater  than  the  reward  for  many  practical  situations.”  As  a 
result,  they  select  lobe  sizes  that  will  produce  experimentally  measured  cumulative  search 
distributions  over  time. 

•  Models  such  as  this  one  are  likely  less  accurate  since  stimuli  upon  which  they  operate 
approach  the  limits  of  any  psychophysical  measurements  upon  which  the  model  is  based. 
Overington  (1982)  pointed  out  that  that  models  based  on  psychophysics  have  specific  “envelopes 
of  usage”  where  their  predictions  are  accurate.  Outside  such  envelopes,  error  propagates  from 
step  to  step  in  calculation,  resulting  in  degradation  in  overall  performance. 

•  Looming  targets  (targets  that  approach  the  observer  along  their  line  of  sight)  are 
modeled  as  an  increase  in  size  and  apparent  contrast  only.  Such  a  characterization  is  inadequate 
to  model  the  phenomenon  of  looming. 

•  Perceptual  learning  not  included  in  model,  so  perfonnance  cannot  improve  with 
practice. 

•  It  is  unclear  how  practice  effects  could  be  included  since  the  model’s 
psychophysical  basis  does  not  include  data  for  trained  versus  untrained  observers. 
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Georgia  Tech  Vision  (GTV/VISEO)  Model  (Doll,  McWhorter,  Schmieder,  &  Wasilewski, 
1995;  Doll,  McWhorter,  Wasilewski,  &  Schmieder,  1998) 


optical/objective  | - X - 

— |  cognitive/subjective 

reductive  | - X — 

— |  comprehensive 

target-centered  | - X - 

— |  situation-centered 

physiological  | - X - 

— |  empirical 

individual  |-X - 

— |  ensemble 

GTV  is  the  general  purpose  vision  model  produced  by  Georgia  Tech.  The  military  target 
acquisition  model  VISEO  (Doll  et  ah,  1997)  incorporates  GTV  into  a  number  of  processing 
modules. 

Basic  operating  principles: 

•  Focus  is  on  psychophysics  and  multi-channel  SF  modeling. 

•  Decisions  are  based  on  the  object-by-object  point  probability  of  being  fixated  and  that 
a  fixated  object  will  be  judged  a  target. 

•  SNR  and  clutter  are  incorporated. 

•  Conspicuities  of  objects  in  the  scene  determine  the  probability  that  they  will  be  fixated. 

•  Clutter  and  training  are  involved  in  determining  these. 

•  Pre-attentive  scene  segregation  into  “blobs”  (i.e.,  target-like  regions)  is  based  on 
texture  segmentation. 

•  GTV  models  visual  system  as  output  of  many  (56)  oriented  spatial  frequency-selective 
channels,  the  output  of  which  undergoes  various  operations.  The  goal  of  the  model  is  to 
intelligently  combine  information  from  channel  output  of  early  vision  so  that  targets  can  be 
distinguished  from  clutter. 

•  GTV  includes  a  simulation  of  optics  of  the  eye  as  well  as  retinal  and  cortical 
VI,  V4  (color),  and  MT  (motion)  visual  processing  area. 

•  GTV  consists  of  a  pre -processor  that  takes  a  scene  or  display  and  converts  it  into  a  map 
of  one  rod  plus  three  cone  output,  followed  by  five  processing  stages:  a  “front  end,”  pre- 
attentive  and  attentive  modules  that  run  in  parallel,  a  selective  attention/training  module,  and  a 
performance  module. 

Each  stage,  in  more  detail: 

Stage  1  -  Front  End 

•  Fuminance:  concerns  receptor  pigment  bleaching,  pupil  dilation,  receptor  thresholds/ 
bleaching,  flicker,  and  transient  luminance  changes 

•  Color:  converts  the  pre-processed  image  from  short,  medium,  long  wavelength 
receptor  activity  to  R/G  and  B/Y  color  opponent  pairs  and  cone  luminance  signal 

•  ->  output  to  pre-attention  and  attention  stages  in  parallel 

Stage  2  -  Pre-attention  module  (search  information) 

•  Perfonns  calculations  of  conspicuity  for  peripheral  vision. 

•  Motion:  temporal  filtering  extracts  local  motion  signals;  temporal  integration  adds  blur 
to  high  spatial  frequencies  of  the  image 
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•  Filtered  separately  from  spatial  information,  similarly  to  human  V4/MT 
processing  (Livingstone  &  Hubei,  1987) 

•  Motion  processing  based  only  on  scotopic  (rod)  and  mean  photopic  (cone 
luminance)  information,  not  chromatic  infonnation. 

•  Pattern  perception  unit  in  module  decomposes  image  into  oriented  SF  channels 

•  Number  of  channels  depends  on  source  (though  always  four  orientations):  two 
from  rods,  12  from  each  color  opponency,  24  from  cone  luminance. 

•  Interactions  between  rectified,  filtered  channels  are  simulated. 

•  Texture  information  extracted. 

•  ->  output  to  selective  attention  module 

Stage  3  -  Attentional  module  (detection  discrimination  information) 

•  Performs  similar  calculations  to  stage  2,  only  now  for  foveal  feature  extraction,  i.e., 
different  acuities,  color  and  motion  sensitivities,  etc. 

•  ->  output  to  selective  attention  module 

Stage  4  -  Selective  attention  module  (assignment  of  Pflx,  Pyes|fix;  training) 

•  Uses  weighted  pre-attentive  output  to  segment  scene  into  objects. 

•  Uses  neural  network  to  set  weights.  Weights  for  discriminant  function  attempt  to 
distinguish  between  target  and  background  pixels.  The  neural  network  uses  training  to  set  up 
this  discriminant  function. 

•  output  of  pre-attentional  operations  is  a  set  of  blobs  representing  potential  targets. 
Uses  weighted  attentive  output  to  segment  foveal  scene  into  objects. 

•  ->  output  of  same  processes  as  on  pre-attentive  information  (only  now  using  filters 
tuned  for  the  fovea)  is  a  map  containing  target-like  foveal  objects. 

•  ->  output  to  Perfonnance  Module 

•  NOTE:  The  neural  network  that  sets  the  weights  of  pre-attentive  and  attentive 
representation  features  that  are  to  be  stressed  must  be  trained  before  GTV  is  run. 

Stage  5  -  Perfonnance  module 

•  Performance  module  computes  measures  of  search  and  discrimination  performance 
based  on  output  of  selective  attention  module: 

•  Calculation  of  Pd,  P(FA),  d',  and  RT: 

•  The  model  simulates  an  observer  selecting  fixation  locations  by  means  of  a 
noisy  decision  process-based  conspicuity.  Conspicuity  is  a  function  of  the  pre-attentional  Pflx 
calculation,  noise,  clutter,  and  the  spacing  of  objects: 

•  quantum  noise,  neural  noise,  and  clutter  (defined  as  “extent  to  which  a 
clutter  blob’s  luminance,  texture,  chromatic  infonnation,  and  temporal  contrast  match  the 
target”) 

•  Spacing  of  target  blob  with  respect  to  clutter  blobs  also  influences 
conspicuity  (consistent  with  Duncan  &  Humphreys,  1989). 

•  At  each  location,  the  signal-to-clutter  ratio  is  calculated: 

•  The  fixated  object  signal  is  based  on  the  pooled  attentive  output 
summed  over  the  blob  area. 

•  The  SCR  is  =  (signal  -  average  clutter  blob  signal )/standard  deviation  of 
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clutter  blob  signals 

•  Appealing  to  signal  detection  theory,  the  SCR  is  equated  to  d'.  Thus,  Pyes|fix  and 
P(FA)  can  be  calculated  once  a  decision  criterion  has  been  assumed  or  measured.  Pd  =  Pflx  x 

P  yes|fix 

•  Once  Pyesjfix  is  known,  the  model  can  determine  how  many  glimpses  are 
required  before  a  decision  is  rendered.  (The  model  assumes  that  fixations  are  selected  from  high 
Pflx  locations  without  replacement.)  Given  a  constant  glimpse  duration,  RT  can  be  calculated. 

Model  predictions: 

•  Probability  that  a  “blob”  is  fixated  on  glimpse  i:  Pflx(i) 

•  Probability  that  a  blob,  once  fixated,  is  determined  to  be  a  target:  Pyes|fix(i) 

•  Pd  (given  a  criterion  for  decision  making  according  to  SNR) 

•  RT,  based  on  number  of  glimpses  required  to  make  judgment. 

NOTES: 

•  Motion  contributes  to  conspicuity  and  causes  blur  before  SF  analysis. 

•  Masking  (interactions  between  SF  channels)  is  implemented  in  the  model. 

•  Channels  are  therefore  not  independent  (see  Olzak  &  Thomas,  1992,  for  a 
discussion  of  such  models) 

•  Glimpse  duration  assumed  to  be  a  constant  1/3  second.  That  is,  all  glimpses  during 
search  are  exactly  333  ms. 

•  Motion  can  but  does  not  necessarily  increase  the  conspicuity  of  a  moving  object. 
CRITIQUE: 

•  The  calculation  of  all  foveal  features  (by  attention  module)  at  the  same  time  is  not 
physiologically  realistic.  The  model  would  be  more  realistic  and  behave  identically  if  it  were  to 
calculate  the  foveal  features  only  after  a  blob  has  been  selected  by  the  perfonnance  module. 

(This  behavior  takes  into  account  the  unbound  feature  nature  of  pre-attentive  and  post-attentive 
vision  by  Wolfe  &  Bennett,  1997.) 

•  Incorporation  of  pre-attentive  stage  to  drive  eye  movements  is  a  good  idea. 

•  Training  at  both  pre-  and  attentive  levels  is  also  a  good  idea. 

•  Eye  movement  assumptions  (i.e.,  selection  without  replacement)  are  unrealistic  (e.g., 
Horowitz  &  Wolfe,  1998;  Nicoll  &  Hsu,  1995). 

•  Although  the  model  can  in  theory  handle  target-absent  trials  (i.e.,  no  response  is  made 
if  every  pre-attentive  blob  is  investigated  and  none  has  sufficient  signal  strength  to  trigger  a 
“yes”  response),  it  can  only  do  so  if  serial  self-tenninating  search  processes  are  assumed.  Such 
an  assumption,  that  after  each  potential  target  is  investigated  once  only  an  absent  judgment  is 
made,  does  not  appear  to  be  the  case  (Chun  &  Wolfe,  1996).  Observers  tend  to  over-search  and 
are  hesitant  to  report  the  absence  of  a  target. 

•  Attention  can  only  refer  to  the  presence  of  features,  not  their  absence.  As  such,  a  less 
green  object  among  more  green  objects  should  be  quite  inconspicuous,  though  in  reality  it  may 
be  quite  conspicuous  (although  the  search  asymmetry  literature  indicates  that  it  would  not  be  as 
conspicuous  as  the  obverse  e.g.,  Wolfe,  1994b]). 

•  Training  issues: 

•  Training’s  effect  is  entirely  based  on  automaticity  (Schneider,  Dumais,  & 
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Shiffrin,  1984).  That  is,  with  training,  any  combination  of  features  can  cause  an  increase  in 
conspicuity.  In  reality,  some  conjunctions  of  features  cannot  be  learned  by  humans  (such  as 
orientation-color  combinations).  This  lack  of  constraint  on  what  can  be  learned  manifests  itself 
as  the  model’s  out-performance  of  humans  and  the  need  to  add  noise  to  make  it  behave  more  like 
a  human  observer  (Doll  et  al.,  1998). 

•  It  is  unclear  to  what  degree  the  training  is  generalizable  to  slightly  different 
targets  or  to  what  degree  more  than  a  single  target  can  be  trained  at  a  time  (as  are  many  neural 
net-based  representations).  These  possibilities  were  addressed  in  neither  paper. 

•  Problem  with  motion  implementation: 

•  Because  motion  information  is  scalar  (only  related  to  speed,  not  direction),  the 
model’s  attention  mechanism  has  no  direction  selectivity,  which  the  human  visual  system  does. 

•  Therefore,  GTV  can  only  distinguish  between  speeds.  This  does  not  allow  the 
system  to  extract  information  about  motion  parallax  and  how  a  moving  target’s  violation  of 
parallax  is  plainly  visible. 

•  Foveation  is  required  for  detection!  Even  though  the  model  is  ostensibly  based  on  the 
conspicuity  of  targets,  highly  conspicuous  targets  must  still  be  fixated  for  the  model  to  produce  a 
“yes”  response.  This  result  is  inconsistent  with  pop-out  (e.g.,  Yantis  &  Egeth,  1999). 
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Itti  and  Koch  (2000)  Saliency-Based  Attention  and  Fixation  Selection  Model 


optical/objective  | — X - 

- 1  cognitive/subjective 

reductive  | - X - 

- 1  comprehensive 

target-centered  |-X - 

- 1  situation-centered 

physiological  |-X - 

- 1  empirical 

individual  |-X - 

- 1  ensemble 

The  Itti  and  Koch  (2000)  model  is  purely  bottom-up  in  nature,  though  it  was  modified  with 
limited  success  by  Turano  et  al.  (2003)  to  incorporate  crude  representations  of  target  features  and 
target  location.  The  model  is  based  heavily  on  known  aspects  of  human,  primate,  and 
mammalian  (feline,  primarily)  visual  psychphysics,  neuroanatomy,  and  electrophysiology.  The 
model  encodes  the  visual  scene  along  three  feature  dimensions  (luminance  intensity,  orientation, 
and  opponent-pair  color  contrast)  at  multiple  scales.  Activation  within  each  feature  dimension  is 
used  to  create  a  conspicuity  map  for  that  feature.  These  three  conspicuity  maps  are  then 
combined  into  a  single  saliency  map.  The  model  defines  the  next  fixation  location  as  that 
corresponding  to  the  point  of  maximum  activation  in  the  saliency  map.  Inhibition  of  return  is 
invoked  as  a  temporary  inhibition  of  this  location  in  the  saliency  map  to  prevent  immediate  re¬ 
fixation  of  the  same  location  in  the  scene. 


CRITIQUE: 

•  ->  The  model  does  not  incorporate  transient  visual  events  (flashed,  motion,  etc.)  into 
its  calculation  of  saliency,  even  though  those  events  have  been  demonstrated  to  capture  visual 
attention  (e.g.,  Yantis,  1996). 

•  The  model’s  performance  for  simple  stimuli  such  as  oriented  and  colored  line 
segments  is  a  good  match  for  human  perfonnance.  However,  though  the  model’s  behavior  in 
real-world  scenes  seems  subjectively  to  be  reasonable  and  actually  located  targets  far  faster  than 
would  a  random  fixation  generator,  results  from  Itti  and  Koch  (2000)  indicate  that  it  does  a  poor 
job  of  predicting  the  response  times  for  human  observers  to  detect  targets.  Further  results  from 
Turano  et  al.  (2003),  though  coded  differently  for  fixation  location,  indicate  that  the  Itti  and 
Koch  (2000)  model  performed  at  chance  levels  during  a  real-world  mobility  task. 
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Guided  Search  (Wolfe,  Cave,  &  Franzel,  1989;  Wolfe,  1994b;  Wolfe  &  Gancarz,  1996) 


optical/objective  | - X - 

- 1  cognitive/subjective 

reductive  | - X - 

- 1  comprehensive 

target-centered  |— X - 

- 1  situation-centered 

physiological  | - X - 

- 1  empirical 

individual  hX - 

- 1  ensemble 

Wolfe  and  colleagues’  Guided  Search  models  integrate  stimulus-driven  (bottom-up)  and  goal- 
directed  (top-down)  mechanisms  in  the  deployment  of  attention  (and  eye  movements  in  Wolfe  & 
Gancarz,  1996)  about  a  scene.  That  is,  the  models  all  incorporate  observer  knowledge  of  target 
attributes  and  guide  attention  to  objects  in  the  scene  that  have  those  attributes.  (Note  that  these 
models  are  based  on  simple  objects  such  as  oriented,  colored  line  segments  with  simple, 
separable  features.  They  were  not  intended  to  be  applied  in  their  present  form  to  real-world 
target  acquisition  situations.  Wolfe  [1994a]  did,  however,  apply  the  guided  search  framework  to 
“naturalistic”  stimuli  with  some  success.) 


Pre-attentive  system  attributes 

•  Operates  in  parallel  across  visual  scene. 

•  Creates  a  map  of  features  present  at  various  locations  in  the  scene 

•  Features:  orientation,  size,  color,  luminance,  motion,  depth 

•  One  spatial  map  per  feature,  with  activation  level  indicating  feature  presence. 

•  Activation  level  a  function  of  feature  and  both  difference  within  feature 
dimension  from  neighbors  (dissimilar  neighbors  ->  higher  activation  than  similar  neighbors)  and 
distance  between  items  (close  ->  higher  activation  than  far).  Therefore,  pre-attentive  system 
calculates  feature  loadings  of  items  and  also  distinctness  of  items. 

•  (This  incorporates  findings  of  Duncan  &  Flumphreys,  1989,  and 

Nothdurft,  1991.) 

•  Noise  is  added  to  feature  map  locations. 

Attentional  system  attributes 

•  Top-down  feature  maps  contain  information  about  features  present  in  the  target. 

•  Features  that  are  unique  are  given  highest  weight. 

Combination  of  bottom-up  and  top-down  activations: 

•  A  master  activation  map  is  created.  High  activations  result  from  locations  that  weigh 
heavily  on  several  feature  maps  from  top-down  and  bottom-up  processing. 

Search  and  detection: 

•  Search  progresses  in  order  from  highest  activation  location  to  lowest. 

•  IOR  is  implemented  so  that  once  a  location  is  searched,  it  is  not  searched  again.  (This 
is  an  unrealistic  assumption.  See  Nicoll  &  Hsu,  1994,  for  data  contradicting  this.) 

•  Search  progresses  until  ( 1)  a  target  is  found,  (2)  a  specific  period  of  time  has  passes 
without  finding  a  target,  or  (3)  activations  are  judged  by  the  observer  to  be  too  low  to  be  targets. 
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•  Target  detection  is  based  on  SDT:  If  activation  of  bottom-up  maps  is  greater 
than  a  decision  criterion,  a  detection  is  rendered. 

•  ->  Situation  (1)  is  a  hit  or  a  false  alarm;  (2)  or  (3)  may  create  misses  or  correct 
rejections. 

•  False  alarms  are  a  result  of  the  decision  criterion  being  shifted  downward  after 
a  miss  (assuming  subjects  are  given  feedback).  False  alarms,  in  turn,  shift  the  criterion  upward. 

CRITIQUE: 

•  ->  Model  makes  specific,  testable  predictions  about  search  performance  in  simple 

tasks. 

•  Guided  Search  is  useful  in  that  it  assumes  (probably  correctly)  that  covert  shifts  of 
attention  (i.e.,  attentional  movement  without  subsequent  eye  movement)  and  eye  movements  are 
determined  in  large  part  by  a  parallel  pre-attentional  system. 

•  Features  are  weighed  so  that  search  asymmetry  results,  pop-out,  and  top-down 
attentional  control  settings  are  accounted  for. 

•  Incorporation  of  several  results  from  search  literature  (e.g.,  pop-out  for  feature 
singletons,  similarity  and  proximity  effects). 

•  The  assumption  of  serial  self-terminating  search  is  almost  certainly  incorrect. 

•  The  generation  of  errors  in  the  model  is  problematic  and  seems  almost  atheoretical. 

•  Inclusion  of  the  mechanism  to  generate  errors  does  create  a  reasonable  looking 
speed-accuracy  trade-off. 

•  Cannot  be  extended  to  situations  in  which  features  are  not  clearly  delineated. 

•  Does  not  work  for  continuous,  naturalistic  stimuli  such  as  textures. 

•  Does  not  have  a  mechanism  to  perform  a  difficult  detection  or  any  kind  of 
discrimination  task. 

•  Inclusion  of  IOR  is  interesting,  though  it  is  unclear  what  role  IOR  actually  plays  in 
search.  See  section  of  report  on  assumptions  of  neoclassical  search  framework.  Memoryless 
search  does  not  pennit  IOR  to  occur. 
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Human  Spatial  Vision  Model  (Wilson,  1991)  Filters  for  spatial  vision 

The  filters  that  determine  spatial  vision: 


Mechanism 

Basis 

Frequency 

Number  of 
Orientations 

Weighted 

Locations 

Number  of 
filters 

Filter  Contrast 
Sensitivity 

A 

0.9  cpd 

6 

6 

36 

30.0 

B 

1.7  cpd 

7 

36 

252 

70.0 

C 

2.8  cpd 

8 

49 

392 

140 

D 

4.0  cpd 

9 

100 

900 

150 

E 

8.0  cpd 

11 

256 

2816 

76.7 

F 

16  cpd 

12 

961 

10,532 

18.4 

Total  number  of  filters:  15,928 

cpd  =  cycles  per  degree 
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Johnson's  (1958)  bar  pattern  equivalence  study  and  the  Johnson  Criteria 


optical/objective  | — X - 

- 1  cognitive/subjective 

reductive  | - X — 

- 1  comprehensive 

target-centered  - 

- 1  situation-centered 

physiological  | - 

— X — 1  empirical 

individual  | - 

- XH  ensemble 

Basics: 

•  Static  performance  model 

•  Stationary  targets 

•  Achromatic 

•  Uniform  background 

Johnson  attempted  to  establish  a  relationship  between  the  number  of  lines  resolvable  on  a  target 
through  an  imaging  device  and  the  degree  to  which  that  target  could  be  acquired.  Subjects 
viewed  scale  models  of  eight  vehicles  and  a  Soldier  through  an  1“  device  and  were  asked  to 
(1)  detect,  (2)  determine  the  orientation  of,  (3)  recognize,  or  (4)  identify  the  target.  (The  level  of 
discrimination  in  task  (2)  is  referred  to  as  classification.) 

Bar  charts  of  the  same  contrast  and  scale  as  the  target  models  were  also  displayed  to  subjects.  At 
each  scale  and  contrast,  Johnson  desired  to  know  how  many  cycles  were  resolvable.  The 
maximum  number  of  resolvable  bar  cycles  across  the  target’s  critical  dimension  was  determined 
for  each  task: 

N  =  Htarg-fx 

in  which  N  =  number  of  cycles  resolvable  across  target  critical  dimension, 

Htarg  =  critical  dimension  of  the  target, 

fx  =  highest  bar  pattern  spatial  frequency  (fundamental  frequency  of  bar). 

Johnson  found  that  so  long  as  the  contrasts  of  the  bar  (light  versus  dark  bands)  and  target  (target 
versus  background)  were  equal,  the  number  of  cycles  on  target  was  found  to  be  independent  of 
both  target  contrast  and  scene  luminance.  In  other  words,  the  ability  of  an  observer  to  perfonn  a 
discrimination  task  was  related  solely  to  their  ability  to  resolve  bar  patterns.  The  following  table 
lists  the  average  number  of  cycles  required  for  an  ensemble  of  observers  to  acquire  various 
military  targets  with  50%  accuracy,  defined  as  Pd  =  P(detect  |  present): 


Resolution  across  critical  dimension  to  perform  50%  accurate  acquisition  at  a  level  of: 

Detection 

Orientation  (classification) 

Recognition 

Identification 

1.010.25 

1.410.35 

4.010.8 

6.411.5 

This  number  of  cycles  for  50%  accurate  ensemble  perfonnance  is  referred  to  as  N50. 

The  shape  of  the  psychometric  function  relating  ensemble  accuracy,  Pd,  to  the  number  of 
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resolvable  cycles  on  target,  N,  is  known  as  the  target  transfonn  probability  function  (TTPF),  and 
has  been  empirically  determined  to  be: 

(N/N50)e 
d  ~  l  +  (N/N50)E 

in  which 

E  =  2.7  +  0.7(jV  /  7V50) 
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NVESD  models:  ACQUIRE  (Tomkinson,  1990)  and  FLIR92  (Scott  &  D'Angostino,  1992) 


optical/objective  |-X - 

- 1  cognitive/subjective 

reductive  | - X - 

- 1  comprehensive 

target-centered  |-X - 

- 1  situation-centered 

physiological  | - 

- X— |  empirical 

individual  | - 

- X — |  ensemble 

FLIR92  is  based  on  the  Bailey  (1970)  framework  of  separable  detection  and  discrimination 
stages  in  search.  ACQUIRE  represents  the  non-time-dependent  discrimination  stage  of  FLIR92 
and  does  not  incorporate  search.  Both  models  use  the  Johnson  (1958)  bar  pattern  equivalence 
metric  for  target  discriminability  using  electro-optical  devices.  ACQUIRE  in  particular  is 
designed  to  predict  the  range  at  which  a  known  target  can  be  acquired  by  an  ensemble  of 
observers. 

Pure  detection  in  ACQUIRE  proceeds  as  follows: 

1 .  The  area  of  a  rectangle  with  the  same  width  and  height  as  the  target  is  calculated.  Call 

it  A. 


2.  The  mean  temperature  difference  between  the  target  and  its  immediate  background,  A 
T,  is  calculated. 

3.  The  number  of  resolvable  cycles  on  the  target,  N,  is  calculated  as 


N  =  —xfr 
R 


in  which  R  =  the  target  range, 


fr  =  the  maximum  spatial  frequency  resolvable  from  the  minimum  resolvable 
temperature  difference  (MRTD)  curve  defined  for  the  sensor  and  atmosphere 

4.  The  ensemble  probability  of  acquisition  is  then  calculated  as  : 

(N/N50)e 


Pd  = 


l  +  (N/N5  0Y 


in  which  E  =  2.7  +  0.7(A/ A50) 


N50  =  number  of  cycles  resolvable  for  50%  ensemble  acquisition  at  the  desired 
level  of  acquisition  (i.e.,  detection,  recognition,  identification) 


NOTES: 

•  ACQUIRE  is  sensitive  to  clutter  in  that  N50  increases  as  level  of  clutter  increases 

•  ACQUIRE  is  not  able  to  handle  motion  effectively,  though  attempts  to  do  so  are  under 
way  (e.g.,  Mazz,  Kistner,  &  Pibil,  1998) 
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FLIR92  adds  a  front  end  search  process  before  the  ACQUIRE  discrimination  stage.  As 
mentioned  before,  the  model  takes  advantage  of  the  following  assumptions: 

•  Glimpse  duration  is  constant  (at  around  0.3  second). 

•  Glimpse  location  is  random  with  replacement. 

•  Each  glimpse  has  an  equal  probability  of  locating  the  target. 

•  Asymptotic  performance  (given  infinite  time)  will  not  be  perfect;  rather,  it  will 
converge  on  the  predictions  of  ACQUIRE. 

The  model  uses  these  assumptions  to  achieve  the  following  performance  prediction  as  a  function 
of  time: 

pd(t)  =  pd(  \-et/TFOr) 

in  which  Ptj  =  Pd,  above  (asymptotic  performance  given  an  infinite  search  time) 

tfov  =  the  mean  time  to  detect  the  target  (equals  average  glimpse  time  divided  by 
the  probability  of  locating  the  target  in  a  single  glimpse) 

. .  .the  average  target  detection  rate,  1/tfov,  is  related  to  target  detail  available  and  required  for 
acquisition  within  a  field-of-view  search: 

1  1  N 

tfov  6.8  A50 


105 


Recognition  by  Components  (RBC)  Theory  (Biederman,  1987) 


nntinl/nhiprtivp  1 

V  |  pnpnitivp/snhipptivp 

Y  1  pnmnrphpnsivp 

in/fiiTirlnnl  1 _ _ 

_ 1  n  c  n  m  Kin 

|  muiTiuuai  |  7n  |  tiiatmuic  | 

The  basic  idea  behind  RBC  theory  is  that  through  the  extraction  of  so-called  “non-accidental” 
properties  of  a  2-D  retinal  image,  a  3-D  representation  of  the  object  can  be  formed.  The 
representation  consists  of  a  selection  of  basic  geometric  fonns  called  “geons.”  The 
representation  of  the  object  is  then  compared  to  internal  representations  of  known  objects. 
Recognition  occurs  when  the  geon  representations  match.  The  internal  representations, 
according  to  the  model,  are  scale  and  viewpoint  invariant  in  that,  so  long  as  an  object  can  be 
broken  into  sufficient  geons  in  a  well-defined  spatial  relationship  with  one  another  to  produce  a 
representation  matching  an  internal  representation,  the  location  in  space,  size,  and  viewing  angle 
of  the  object  are  unimportant. 

The  primary  non-accidental  properties  of  an  image  are  based  on  regions  of  deep  concavity, 
which  correspond  highly  with  Marr  and  Hildreth’s  (1980)  concept  of  zero  crossings  in  a 
difference-of-Gaussian-processed  image.  (In  a  neural  network  model  instantiating  RBC  theory, 
Hummel  &  Biedennan,  1992,  used  a  DOG  or  DOOG  (difference  of  offset  Gaussians)  operator  to 
extract  edges  at  an  early  stage  of  processing.)  The  edges,  however  extracted,  hint  at  3-D  surface 
characteristics  by  means  of  Gestalt-like  principles  such  as  grouping,  symmetry,  and  similarity, 
and  by  the  interpretation  of  T-  and  L-junctions.  These  principles  and  properties  are  used  to 
generate  inferences  about  the  underlying  geon  structure  of  the  object  that  precipitated  the  retinal 
image.  A  key  feature  of  the  theory  is  the  idea  that  not  all  parts  of  an  edge  drawing  of  an  object 
are  necessary  for  the  extraction  of  the  object’s  shape.  Rather,  it  is  the  “cusps”  or  junctions  that 
are  crucial. 


Theoretical  problems  with  the  model: 

•  The  notion  of  invariance  has  not  withstood  uniformly  empirical  examination  (e.g., 
Hayward  &  Tarr,  1997),  indicating  that  the  internal  representation  of  objects  may  not  be  as 
simple  as  RBC  holds. 

•  Surface  characteristics  may  have  an  effect  on  extraction  of  a  geon-based  representation 
of  objects  (Hayward  &  Tarr,  1997). 

•  Tarr  and  Bulthoff  (1995)  argue  that  a  geon-based  structural  description  is  inadequate 
for  the  recognition  of  category-level  objects. 

•  (However,  good  agreement  with  field  testing  of  sub-category  object  recognition 
and  the  errors  made  in  such  recognition  seems  to  lend  support  to  the  generalizability  of  some 
aspects  of  RBC  theory  [O'Kane,  Biederman,  Cooper,  &  Nystrom,  1997].) 

•  As  the  aspect  ratios  of  geons  is  not  hypothesized  in  RBC  theory  to  be  included  in  the 
internal  representation  of  objects,  some  discriminations  cannot  be  performed.  For  example,  the 
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only  distinction  between  a  Boeing  747-400  and  a  747-ST  is  that  the  latter  is  shorter.  Both 
objects  are  composed  of  the  same  components,  however,  so  that  RBC-based  recognition  cannot 
distinguish  between  them. 

Modeling  target  acquisition  with  RBC  theory: 

•  Since  RBC  relies  on  the  extraction  of  object  primitives  that  are  based  on  a  line  drawing 
of  the  object  in  2-D  space,  models  that  use  a  DOG  or  DOOG  operation  to  extract  edges  from  an 
image  may  be  particularly  suitable.  Such  an  edge  extraction  technique  would  have  to  have  some 
way  of  eliminating  the  edge  artifacts  of  surrounding  clutter  and  shadows. 

•  The  fact  that  RBC  ignores  surface  characteristics  such  as  texture  and  color  indicates 
that  during  certain  circumstances,  it  may  be  inapplicable. 

•  A  key  difficulty  for  RBC  theory  as  a  general  purpose  object  recognition  explanation  is 
the  fact  that  it  can  distinguish  only  between  basic  categories  of  objects,  such  as  tanks  and  jeeps. 
Because  geons  do  not  have  extent,  internal  object  representations  may  not  be  able  to  distinguish 
between  members  of  the  category,  that  is,  identification  discrimination. 
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Rand/Bailey's  (1970)  classical  model  of  search 


optical/objective  |-X - 

- 1  cognitive/subjective 

reductive  | - X — 

- 1  comprehensive 

target-centered  |-X - 

- 1  situation-centered 

physiological  | - 

- X— I  empirical 

individual  | - 

- X— I  ensemble 

Primary  contributions  of  Bailey  models: 

•  Target  acquisition  is  considered  to  consist  of  three  distinct  steps:  time-dependent 
search,  time-independent  detection,  and  time-independent  discrimination.  Each  step  is 
considered  independent  of  the  others,  although  all  depend  on  the  same  information  in  the  scene 
(obviously),  and  each  subsequent  step  presupposes  that  the  previous  step  has  occurred.  The 
information  is  treated  separately,  though. 

•  The  independence  assumption  allows  the  probability  of  discrimination  to  be  the 
product  of  the  probabilities  of  each  stage  succeeding: 

P  =  Pi  x  P2  x  P3 

in  which  Pi  =  probability  of  locating  target  in  a  single  glimpse, 

P2  =  probability  of  detecting  a  located  target, 

P3  =  probability  of  discriminating  a  detected  target 

•  Pi  is  a  hard-shell  search  with  a  fixed  glimpse  aperture  Ag. 

•  P2  is  contrast-based,  assumes  SNR  »  1,  based  on  observed  target  size  and 
contrast,  though  contrast  for  targets  specifically  modeled  (ground  targets  as  seen  from  the  air)  are 
rarely  of  absolute  contrast  >1. 

•  P3  is  based  loosely  on  the  Johnson  criteria  in  that  it  is  based  on  the  number  of 
resolvable  “cells”  across  the  smallest  target  dimension.  The  model  attempts  to  fit  the  asymptotic 
probability  of  discrimination  to  the  number  of  cycles  (i.e.,  the  TTPF)  with  an  inverse  exponential 
cut-off  at  0  probability  of  discrimination  when  cycles  <  2. 

Particulars 

•  Validated  against  Blackwell  data. 

•  Not  a  near-threshold  model.  Targets  were  detected,  based  on  contrast  rather  than  SNR. 

•  Location  stage  is  a  non-guided,  deliberate  search. 

•  Detection  stage  is  dependent  on  unconscious  visual  detection  of  contrast. 

•  Discrimination  stage  is  conscious,  effortful  process. 

•  Glimpse  rate  and  duration  are  constant  (0.3  second). 

•  Eye  movements  not  selected  at  random  or  completely  systematically: 

•  Distance  of  search  saccade  should  be  affected  by  Ag,  the  effective  glimpse 
aperture  over  which  foveal  search  can  occur.  Ag  is  influenced  by  size  of  known  target  with 
respect  to  FOV  size. 

•  Bailey  models  probability  of  glimpse  landing  on  target  as  function  of  glimpses 
(or  time)  as  1  minus  an  inverse  exponential,  that  is,  as  the  distribution  of  first  arrival  times  of  a 
Poisson  process. 
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•  Given  sufficient  time,  the  search  portion  of  the  model  will  eventually  fall  on  target. 


How  clutter  is  handled: 

•  Clutter  is  modeled  as  a  scene  congestion  parameter,  G,  which  varies  from  1  to  10  and 
indicates  the  density  of  target-like  scene  elements.  G’s  primary  influence  in  PI  is  to  reduce  the 
size  of  the  glimpse  aperture.  That  is,  more  clutter  causes  smaller  glimpses  and  shorter  saccades 
(which  is,  in  fact,  the  case). 


Pit)  i  =  1 


1 


e 


700  aT , 

- -]< 

G  A, 


in  which  aT  =  target  size, 

As  =  search  area, 

G  =  scene  congestion  { 1 ..  10} ,  and 
t  =  search  time 


•  Clutter  only  plays  a  role  in  location,  not  detection  or  discrimination. 

CRITIQUE: 

•  No  guidance  of  search  process  aside  from  knowledge  of  target  size  that  drives  saccade 
size  (Ag). 

•  Search  process  cannot  tenninate  on  “target  not  found”  decision. 

•  Contrast  modeling  for  required  detection  in  P2  may  be  too  specific  for  general  use. 
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VIDEM  (Akerman  &  Kinzly,  1979) 


optical/objective  |-x - 

- 1  cognitive/subjective 

reductive  | - X - 

- 1  comprehensive 

target-centered  |— X - 

- 1  situation-centered 

physiological  | - 

- XH  empirical 

individual  | - 

- X— I  ensemble 

Particulars: 

•  validated  by  Blackwell  data  (in  terms  of  contrast  threshold) 

•  Search  type:  soft  shell 

•  Background:  cluttered 

•  Targets 

•  stationary,  single  targets,  non-chromatic,  equivalence  to  circles  of  a  certain 

diameter 

•  Search  location  selection:  random 

•  Bailey  (1970)  search  framework 

Detection  stage: 

•  target  contrast  is  modeled  to  be  that  of  a  disk  of  diameter  equivalent  to  the  target’s 
critical  dimension 

•  driven  by  contrast  threshold 

•  contrast  threshold  is  a  function  of  target  size  (equivalent  disk  diameter)  and 
retinal  eccentricity: 

CT  =  0.035290'24  +  0.584916/a2 

Discrimination  stage: 

•  driven  primarily  by  clutter  (see  below) 

Clutter  inclusion: 

•  clutter  increases  glimpse  duration,  decreases  distance  of  search  saccades,  increases  eye 
response  time,  and  increases  contrast  threshold 

•  clutter  is  assumed  to  be  a  GLOBAL  metric 

•  Mean  scene  clutter,  M-bar,  is  calculated  by  Waldman,  et  al.’s  (1988)  gray-level 
co-occurrency  metric,  which  bases  clutter  on  similarity  of  background  and  target  structure. 

•  Target  must  be  known  in  order  to  calculate  M-bar 

•  instantiated  in  similar  manner  to  Greening’s  (1976)  MARS  AM  model: 

P3  =  [1  +  M/29tg0'93]"129 

in  which  M  =  number  of  confusable  objects  (from  Waldman,  et  ah,  1988), 
tg  =  average  glimpse  time 

•  effect  on  tg: 

tg  =  (0.5782  +  M)Os'0-2132 

in  which  9S  is  circular  search  field  size  (equivalent  to  saccade  distance) 
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•  effect  on  saccade  distance: 


9S  =  0.152tg'3'127 

•  effect  on  contrast  threshold  (instantiated  as  a  multiplier  to  contrast  threshold): 

Fh  =  exp(-5.46M2'37) 


CRITIQUE: 

VIDEM  does  a  good  job  of  representing  the  effects  that  clutter  is  known  to  have  on  search. 
However,  the  numerous  effects  of  clutter  (and  the  number  of  parameters  that  must  be  fit  to  a 
validation  data  set)  yield  a  model  where  it  may  be  difficult  to  weigh  the  effects  on  a  single 
process.  Also,  treating  targets  as  equivalent  disks  will  be  problematic  for  targets  that  are  known 
to  have  a  high  degree  of  anisotropy,  a  length- to-width  ratio  vastly  different  from  1:1.  VOM 
attempts  to  address  these  two  shortcomings. 


Also: 

•  Random  saccade  locations  are  unrealistic,  given  that  Waldman’s  clutter  metric  has 
been  used  by  itself  to  predict  fixations  in  a  cluttered  scene.  That  is  to  say,  the  co-occurrence 
metric  used  to  calculate  a  local  clutter  metric  yields  regions  of  the  scene  that  are  highly  similar  to 
a  target.  As  such,  attentional  guidance  to  regions  of  similarity  to  the  known  target  (recall  that  the 
target  must  be  known  in  detail  to  calculate  M)  will  cause  saccades  to  known  target  locations. 

•  The  fact  that  VIDEM  uses  only  a  global  clutter  metric  is  the  basis  for  their 
assumption  of  random  saccade  location  selection. 
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Visual  Observer  Model  (VOM)  -  Akerman  (1992, 1993b) 


optical/objective  j-X - 

- 1  cognitive/subjective 

reductive  | - X - 

- 1  comprehensive 

target-centered  |— X - 

- 1  situation-centered 

physiological  | - 

- XH  empirical 

individual  | - 

- X— I  ensemble 

Particulars: 

•  An  extension  of  the  VIDEM  model  (Akennan  &  Kinzly,  1979) 

•  Excludes  some  effects  of  clutter 

•  Targets  may  be  represented  differently:  by  a  function  of  their  “useful  Area,  Au”  (rather 
than  an  equivalent  disk  area)  equal  to  a  portion  of  the  area  projected  inward  from  the  perimeter 
of  the  target 

Detection  stage: 

•  Contrast  threshold  can  now  be  calculated  by  the  VOM  criteria  (excluding  the  clutter 
multiplier,  FH)  or  by  Nachman  (1953)  criteria: 


Ct  -  Kipk2/Au 

in  which  Ki  and  K2  are  constants,  empirically  derived,  based  on  the  adaptation  luminance, 
p  =  target  perimeter,  and 

Au  =  useful  area  measured  inwards  from  perimeter  (in  angular  distance). 
Differences  from  VIDEM: 

•  Clutter  is  not  assumed  to  have  an  effect  on  contrast  threshold  in  search  stage. 

•  The  notion  of  the  eye's  response  time  (an  additive  factor  to  glimpse  duration  that 
depends  on  clutter)  is  eliminated. 

CRITIQUE: 

The  same  criticisms  based  on  saccade  location  selection  still  hold  for  VOM  as  they  did  for 
VIDEM. 

The  modification  of  target  “size”  to  include  useful  area  is  potentially  quite  useful.  The  useful 
area  notion  introduces  more  observer-based  knowledge  to  the  target  acquisition  situation  since 
the  area  of  a  target  that  is  deemed  “useful”  will  depend  on  its  structure,  which  the  observer  is 
also  presumably  looking  for.  Given  that  the  gray-level  co-occurrency  matrix  upon  which  the 
clutter  metric  is  based  concerns  target  and  background  structure,  it  may  be  argued  that  a 
detection  stage  that  keys  onto  useful  area  is  also  incorporating  some  knowledge  of  structure  and 
thus  may  give  better  agreement  with  the  clutter  metric.  Eye  movements  may  therefore  be  better 
accounted  for,  albeit  not  directly. 
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Appendix  B.  Proposed  Metrics  for  Motion,  Clutter,  Conspicuity,  and 
Distinctness 


This  appendix  contains  details  of  the  various  metrics  discussed  throughout  the  review. 

Number  of  confusing  forms  clutter  metric  (M)  -  Ryll  (1962) 


1  +  v0.29?0'93  J 

in  which  M  =  number  of  confusable  fonns  visible  within  a  glimpse  and 
t  =  glimpse  duration. 

NOTES: 

•  Global  metric. 

•  Only  affects  recognition. 

•  Used  in  VIDEM  and  VOM  models  with  M  calculated  by  means  of  Waldman’s  co¬ 
occurrence  clutter  metric,  CN. 

Scene  congestion  (G)  metric  (Bailey,  1970) 


1 

f  M  \ 
v0.29f°'93  , 


m=i— 4^- 
e^' 

in  which  aj  =  target  size, 

As  =  search  area, 

G  =  scene  congestion  factor  { 1 ..  10} ,  and 
t  =  search  time. 

NOTES: 

•  Clutter  is  modeled  as  a  scene  congestion  parameter,  G,  which  varies  from  1  to  10  and 
indicates  the  density  of  target-like  scene  elements.  G’s  primary  influence  in  PI  is  to  reduce  the 
size  of  the  glimpse  aperture.  That  is,  more  clutter  causes  smaller  glimpses  and  shorter  saccades 
(which  is  actually  the  case  [e.g.,  Akennan,  1992]). 

•  Clutter  only  plays  a  role  in  location,  not  detection  or  discrimination. 

•  Global  metric. 
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Target  conspicuity  (Kp)  (Williams,  1966) 


Pd  =  1  -  e 


Kpt/Ad 


in  which  Kp  =  target  conspicuity, 
t  =  search  time,  and 
Ad  =  display  area. 

NOTES: 

•  First  mention  of  how  conspicuity  can  be  instantiated  into  a  model. 

•  Author  realized  that  of  the  many  possible  factors  incorporated  in  conspicuity,  only 
luminance  contrast  was  well  defined  (at  the  time). 

•  Clutter  affects  detection  probability  (Pi)  only. 

•  Kp  empirically  determined. 


Simple  First  Order  scene  metrics  (Pratt,  1991,  for  overview) 

•  Absolute  average  intensity  difference: 

•  RMS  intensity  and  target  variance  difference: 

•  Adjusted  RMS  intensity  and  target  variance  difference: 


I  I 

V(/fi  —  P B  )~  °Y 

V  (  Pt  ~  Pb  )  4crr 


•  Absolute  mean  intensity  difference  plus  absolute  standard  deviation  \  juT  -  juB  \  +  \crT -crB  \ 
difference: 

•  Absolute  mean  intensity  difference  plus  target  standard  deviation:  |  juT  -  juB  \  +crT 

•  The  Doyle  metric  (Copeland,  Trivedi,  &  McNamey,  1996):  yl(pT  ~ Mb )2  +  (°Y  ~<jb)2 

•  The  Doylemod  metric  (Copeland,  et  ah,  1996):  ^](pT  ~ Mb)2  +k(crT  -  <jb )2 

•  The  nrms  metric  (Kosnik,  1995): 


(JT 


Pt+b 


NOTE:  p  r  =  mean  of  gray-level  distribution  over  the  target  area 

pB  =  mean  of  gray-level  distribution  over  background  support  (typically  area 
immediately  around  target  area) 

ax  =  standard  deviation  of  gray-level  distribution  over  target  area 

aB  =  standard  deviation  of  gray-level  distribution  over  background  support 

k  =  modulation  factor  for  variance  difference 


NOTES: 

•  First  order  metrics  or  any  combination  of  them  lack  structural  information  about  the 
target  or  background  support  and  thus  cannot  be  used  for  feature  extraction. 

•  A  more  complex  class  of  first  order  metrics  is  based  on  nonnalized  histograms,  described 

next. 
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The  gray-level  co-occurrency  matrix 


This  matrix  represents,  within  an  area  of  a  pixilated  image,  the  frequency  of  one  gray  level 
occurring  in  a  specified  linear  spatial  relationship  with  another  gray  level.  The  co-occurrency 
matrix,  PA(i,j),  is  a  GxG  dimension  matrix  in  which  G  is  the  number  of  gray  scale  levels  in  the 
image.  It  is  defined  by 

1  N 

PaUJ)  =  -£/(**  =  Uxk+ A) 

N  k= 1 

in  which  (xk,  xk+A)  =  a  pair  of  pixels  with  gray  levels  i  and  j, 

i  and  j  =  gray-level  values  from  0  to  a  maximum,  G,  separated  by 

A  =  a  displacement  vector  which  is  a  function  of  the  distance,  5,  between 
the  pixels  and  the  angle  0  between  them. 

f  =  {1  if  xk=i  and  xk+A=j,  or  0  otherwise} 

N  =  number  of  pixels  in  the  area  of  the  image. 


Normalized  Clutter  metric  (Cn)  (Waldman,  Wooton,  Hobson,  &  Luetkemeyer,  1988) 

(Note:  the  gray-level  co-occurrency  matrix  PA(i,j)  is  detailed  in  the  text.) 

To  Calculate: 

•  The  amount  of  clutter  C  is  calculated  as  the  mean  of  the  product  of  the  relative  texture 
size  and  the  distance-weighted  transition  probability  (i.e.,  the  probability  of  transitions  between 
gray  levels  in  the  co-occurrence  matrix): 

C  =  jB(  A) 

in  which  s  =  average  texture  element  size, 

T  =  average  target  size, 

A  =  polar  displacement  (see  text),  and 

i= 0  j= o 

•  The  normalized  clutter  is  defined  as  either  C/Be  or  1 ,  whichever  is  smaller. 

•  Be  is  the  expected  value  of  B. 


NOTES: 

•  works  for  uniform  textures  only 

•  Has  overly-simplistic  mathematical  properties: 

•  It  is  symmetric  with  respect  to  target  size  and  background  texture  size;  ignores 
search  asymmetry  literature. 
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Texture-based  clutter  (TIC)  metric  (Shirvaikar  &  Trivedi,  1992) 

To  Calculate: 

•  This  metric  is  also  based  on  the  gray-level  co-occurrence  matrix.  It  is  similar  to 
nonnalized  clutter,  except  that  it  puts  quadratic  instead  of  linear  weight  on  differences  in  gray 
level. 

•  First,  calculate  the  “inertia”  of  the  co-occurrence  matrix,  AI: 

/<A)  =  ZI Ji-jfPAi-f) 

1=0  y'=0 

•  We  calculate  the  global  TIC  by  dividing  the  inertia  by  the  target  size,  A: 

r/c  =  ^> 

A 

NOTES: 

•  global  metric 

•  depends  on  target  size 

•  Performs  marginally  better  than  SV. 

•  The  authors  recognize  that  the  metric  alone,  because  it  fails  to  capture  internal  target 
structure,  may  not  capture  perceptually  meaningful  information  and  should  be  used  in  addition  to 
such  measures  (Shirvaikar  &  Trivedi,  1992). 


Average  Co-occurrence  Error  (ACE)  metric  (Copeland  &  Trivedi,  1996, 1998) 


To  Calculate: 

•  Define  a  target  and  background  region 

•  Define  the  “texture  model”  as  the  number  of  pixels  away  from  each  other;  two  pixels 
within  each  region  are  then  compared  (typically  eight  pixels  are  considered) 

•  ACE  is  the  absolute  difference  between  corresponding  elements  of  target  and 
background  co-occurrence  matrices,  summed  over  all  possible  displacement  vectors  of  the  length 
specified  within  the  texture  model  (see  Copeland  &  Trivedi,  1996,  for  more  details): 


in  which 
model 


ACE  = 


7T— I  A  )-PB(i,j  |  A)| 

©  NGLC  AeO>  i= 0  j= 0 


©nglc  =  total  number  of  displacement  vectors  in  the  set  ®  of  vectors  in  the  texture 


G  =  number  of  gray  scale  levels 


PT(i,j  |  A)  =  joint  probability  of  a  pixel  of  gray  level  i  and  gray  level  j  given  the 
displacement  vector  A  for  the  target  pattern 

PB{i,j  |  A)  =  corresponding  joint  probability  for  the  background  pattern 


•  The  total  of  displacement  vectors  of  separation  eight  pixels  is  144  displacements. 
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•  To  simplify  calculation,  the  number  of  gray  levels  is  typically  reduced  to  eight,  since 
computation  becomes  very  laborious  with  144  256x246  matrix  operations  to  calculate  the  ACE. 

NOTES: 

•  Authors  used  this  metric  to  predict  human  judgments  of  texture  differences. 

•  Metric  outperformed  both  Doyle  metric  and  a  model  based  on  boundary  strength 
(Muller,  1986). 

•  Local  clutter  metric. 


Circular  Symmetry  (CSs)  clutter  metric  (Reisfeld,  Wolfson,  &  Yeshurun,  1995) 


To  calculate: 

•  Take  each  pixel  P  and  calculate  a  set  of  values  based  on  the  local  gradient  in  the  area 
and  the  symmetry  of  the  point  in  eight  radial  directions  about  points  in  the  area:  Sg(i,  P),  where  i 
is  the  direction  of  symmetry. 

•  The  symmetry,  CSgCP),  for  each  point  is  the  product  of  S8’s  for  all  eight  directions: 

CSs(P)  =  fl[l  +  Ss(i,P)] 

i= 1 

•  A  global  symmetry  metric  is  calculated  thus: 

•  Divide  the  scene  into  k  rectangular  blocks. 

•  Calculate  the  sum  of  symmetry  values  within  each  block: 


CSM=ZCSs(P) 

Psk 


The  global  metric  is  the  root  mean  square  of  the  block-wise  symmetries: 


CN  = 


1  N 


,  1/2 


NOTES: 


•  Assumptions  behind  metric  are  that  (1)  man-made  objects  are  more  likely  than  natural 
scene  elements  to  have  a  high  degree  of  symmetry,  and  (2)  visual  system  is  able  to  readily  locate 
regions  of  high  local  symmetry  in  a  scene. 


Statistical  Variance  (SV)  clutter  metric  and  the  SCR  (Schmieder  &  Weathersby,  1983) 


To  calculate: 

•  Divide  scene  into  N  blocks,  each  twice  the  height  and  width  of  a  known  target. 

•  Calculate  gray-level  variance  of  pixels  within  each  block  i. 

•  SV  is  the  root  mean  square  of  the  block  variance: 

(  1  n  V/2 

SV=  —Ycr? 


We  calculate  SCR  by  dividing  the  target  contrast  with  its  immediate  background  by  SV: 
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SCR  = 


maximum  target  value  -  maximum  background  value | 

SV 


NOTES: 

•  Based  on  idea  that  the  visual  system  is  interested  in  areas  of  the  scene  with  high  gray- 
level  variability.  Consistent  with  the  notion  of  “confusing  forms”  in  that  targets  are  presumed  to 
have  high  gray-level  variability,  although  SV  does  not  take  into  account  actual  target  structure. 
(Instead,  it  uses  the  variance  of  targets  as  a  generalization  of  target-like  structure.) 

•  SV  is  a  global  measure,  though  cr,2  represents  a  local  metric  for  clutter. 

•  This  metric  underestimated  perfonnance  for  urban  clutter,  indicating  that  variance 
alone  does  not  completely  instantiate  clutter  (Cathcart,  Doll,  &  Schmieder,  1989). 


Probability-of-Edge  (POE)  clutter  metric  (Tidhar  et  al.,  1994) 


To  calculate: 

•  Convert  gray-scale  image  into  edge  map. 

•  Image  is  divided  into  regions.  Regions  are  assigned  a  value,  depending  on  the  fraction 
of  pixels  within  it  that  are  edges,  POE; 

•  Overall  probability  of  edge  for  an  image  is  the  rms  of  local  POEi’s: 

(  ,  N  V/2 

POE  =  \ 


1  N 

- Ypoe 2 
Nil 


j 


in  which  POE;,t  =  probability  of  edge  in  region  i  with  DOOG  filter  threshold  T 


NOTES: 

•  Based  on  idea  that  early  visual  processing  is  involved  in  edge  detection  and  extraction 
(e.g.,  Marr  &  Hilldreth,  1980). 

•  Edge  detection  performed  with  a  DOOG  whose  output  over  the  scene  is  thresheld  to  a 
level  T  to  yield  a  yes/no  pixel-by-pixel  edge  map  of  the  scene. 

•  Local  or  global  metric  of  clutter,  depending  on  whether  POEi  or  POE  is  of  interest. 


Peak-Signal  (ATPS)  clutter  metric  (Rotman,  Kowalczyk,  &  George,  1994) 

To  calculate: 

•  Set  a  tolerance  AT  and  a  minimum  cluster  size. 

•  Start  with  a  pixel  at  a  comer  and  compare  it  to  its  neighbor.  If  the  intensity  difference 
is  within  AT,  average  the  two  intensities  and  join  them  into  a  cluster. 

•  If  the  difference  is  greater  than  AT,  then  the  new  pixel  is  assigned  to  a  new  cluster. 

•  Repeat  this  for  all  pixels,  then  for  all  existing  clusters  until  clusters  are  at  least  as  large 
as  the  minimum  cluster  size. 

•  The  peak-signal  ATPS  is  calculated  as: 
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ATps  — 

1  + 

in  which  Tmax  and  Tmin  =  intensities  of  the  most  and  least  intense  clusters  and 
A(Tmax)  and  A(Tmin)  =  areas  of  the  most  and  least  intense  clusters. 

•  We  may  calculate  block-wise  ATps.i  by  computing  ATps  for  arbitrary  blocks  of  the 

scene 

•  This  step  may  be  useful  for  eye  movement  validation  of  the  metric,  but 
otherwise,  it  runs  the  risk  of  cutting  clusters  down  the  middle. 

NOTES: 

•  Based  on  the  contrast  between  local  extrema  and  their  background. 

•  Divides  scene  into  clusters  by  grouping  pixels  of  the  image  together  into  regions  of 
high  and  low  intensity  (the  T  in  the  metric  is  short  for  temperature)  based  on  the  contrast  of  the 
pixel  with  its  neighbor.  Groups  of  pixels  are  likewise  grouped  together  with  their  neighbors  until 
clusters  of  the  minimum  size  defined  by  the  user  are  achieved. 

•  A  global  metric  for  clutter. 


_ max _ min _ 

ATmJ~A(Tmm) 
A(TmJ  +  A(TmiD) 


Target  Complexity  (TC)  metric  (Tidhar  et  al.,  1994) 

To  Calculate: 

•  Calculate  a  histogram  of  edge  intensities  by  means  of  a  DOOG  filter  over  the  target 
area  and  its  immediate  surround.  Let  the  histogram  be: 


in  which  0  and  G-l  are  the  minimum  and  maximum  histogram  values 
Nj  =  the  number  of  pixels  in  the  bin  at  level  i 

•  The  corresponding  cumulative  distribution  of  edge  intensity  levels  is: 

s^-ipiiN‘ 

in  which  M  =  the  number  of  points  in  the  histogram. 

•  SN  has  the  following  properties: 

Sn(0  =  0  when  i<0 
SN(i)  =  1  when  i>G- 1 
SN(i)  <  SN(i+l) 

•  Target  detectability  is  proportional  to  the  absolute  mean  distance  between  cumulative 
edge  histograms  of  the  observed  target  section  (Sn)  and  the  situation  when  all  pixels  have  the 
same  value  (P(i)): 
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TC  =  T  i  I  s„  (/)-/>(/)  I 

tz  i=0 

in  which  f(/)  =  —  (uniform  distribution) 

G 

NOTES: 

•  local  metric 

•  Reasonable  correlation  to  overall  search  RT  (authors). 

•  The  size  of  the  surroundings  taken  with  the  target  seems  to  be  crucial,  as  a  uniform 
local  surround  yields  a  prediction  of  zero  clutter  even  when  the  overall  scene  may  be  very 
complex. 


Complex  Clutter  metric  (K)  (Lillesaeter,  1993) 


To  Calculate: 

•  An  image  with  a  visible  target-background  border  must  be  selected. 

•  Let  Z  be  the  entire  length  of  the  target  contour. 


in  which 


K  =  a  |  lnGT  - lnGB 


b  r 

In 

f  Gil 

H - <P 

Z  Jz 

Ig bJ 

dz 


Gt  =  pixel  gray  value  distribution  of  the  target  area 

Gb  =  pixel  gray  value  distribution  of  the  background  support  area 

a,  b  =  weight  factors  that  sum  to  unity  (usually  assumed  to  be  0.5). 


NOTES: 

•  Incorporates  variable  target-background  contrast  around  border  with  a  first  order 
metric  of  contrast. 

•  The  first  term  is  the  mean  area  contrast. 

•  The  second  term  corresponds  to  the  contrast  around  the  entire  target-background 
boundary. 

•  The  amount  of  the  background  to  incorporate  into  the  contrast  calculation  is  arbitrary. 

•  Does  not  take  into  account  structure  of  target  or  length  of  perimeter  (which,  in  extreme 
circumstances,  may  inflate  the  metric). 

•  Local  metric 


Normalized  Histogram  Intersection  and  CAMELEON  camouflage  strength  (C)  (Hecker, 
1992) 

Normalized  gray-level  histogram  calculation  (for  n-bit  gray  level): 

•  Let  h(v)  denote  the  histogram  entry  for  value  v,  and  let  the  image  represent  a  function 
with  2n  levels  on  a  rectangular  array  of  width  w  and  height  h: 

^  h(v)  =  wxh 

V 

with  ve[0,2;'] 


120 


•  If  na  is  the  number  of  pixels  in  the  area  over  which  the  histogram  is  computed,  then  the 
nonnalized  histogram  H(v)  is  defined  as: 


H(y)  = 


h(v) 


n 


a 


(Note:  The  area  over  which  the  histogram  is  calculated  does  not  need  to  be  rectangular. 
However,  it  is  assumed  to  be  a  rectangular  region  around  the  target  for  the  calculation  of  this  and 
most  other  metrics  used  in  target  acquisition  models.) 


Histogram  Intersection  calculation: 

•  Let  Ht  be  the  nonnalized  histogram  containing  the  target  and  HB  the  normalized 
background  histogram. 

•  The  intersection  of  the  matrices  is  defined  as  the  cumulative  sum  of  the  pairwise 
minimum  of  conesponding  histogram  bin  heights: 

HTCiHB=  minjify  (v),  H  B  (v)} 

v — 1 

in  which  n  =  number  of  bins 


•  The  value  of  the  intersection  will  be  between  0  and  1 :  0  =  no  overlap,  1  =  complete 
overlap. 

Camaeleon  calculation: 

•  Start  with  a  gray  level  or  color  image  input  image. 

•  Specify  a  region  in  the  image  as  target  and  a  region  as  background  (need  not  be  same 

size). 

•  Camaeleon  convolves  image  with  set  of  quadriture  band-pass  filters  to  derive  pixel-by¬ 
pixel  representations  for  the  target  and  background  regions: 

•  Local  energy  based  on  sum  over  all  bands  of  the  energies  of  individual  filters 

•  Local  spatial  frequency  based  on  vector  sum  of  complex  frequency  averaged 
over  all  filter  bands 

•  Local  orientation  is  computed  as  vector  sum  of  directions  over  all  filter  bands 

•  Normalized  histograms  are  then  calculated  for  target  and  background  pixels  in  energy 
(HEt  and  HEB),  frequency  (HFT  and  HFB),  and  orientation  (HOT  and  HOB) 

•  Camouflage  strength,  C,  is  calculated  as  the  product  of  the  histogram  intersections: 

C  =  (HEr  [)HEb )  •  (1 10,  n  HOb)-  ( HFt  n  HFb ) 

NOTES: 

•  C  is  inverse  of  conspicuity  (C  may  be  thought  of  as  measure  of  local  clutter)  but  is  only 
defined  on  [0,1].  The  rank  order  of  targets  with  different  values  of  C  will  reflect  the  rank  order 
of  their  clutter. 

•  Assumption  is  made  that  orientation,  frequency,  and  energy  are  all  equally  important  to 
estimates  of  camouflage. 

•  Boundaries  between  target  and  its  background  are  not  necessarily  taken  into  account. 

•  Being  based  on  first  order  metrics  (i.e.,  histograms)  of  individual  features,  the  spatial 
configuration  of  the  features  is  not  specified  in  the  metric. 
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Cortex  Transform-based  distinctness  (d)  (Watson,  1987;  Ahumada  &  Beard,  1996;  Rohaly, 
Ahumada,  &  Watson,  1997) 


To  Calculate: 

•  Take  image  with  target,  Ii,  and  image  without  target,  I0. 

•  Convert  images  to  luminance  contrast  by  subtracting  and  then  dividing  by  the  mean 
background  image  luminance: 

/y^(/.-70)//0 

•  We  then  applied  contrast  sensitivity  filter  S  to  fi  by  multiplying  its  Fourier  components 
by  the  magnitude  of  S’s  component  wavelengths  and  then  recombining  the  components  with  the 
inverse  Fourier  transform: 

Ij^F'iSFVj]] 


•  Next,  the  Cortex  transform  is  applied  to  the  image.  The  cortex  transform  corresponds 
to  a  set  of  20  filters:  five  spatial  frequencies  with  four  orientations  each,  applied  to  every  point 
(x,y)  in  the  image.  The  resulting  coefficients,  corresponding  to  the  signal  strength  of  the 
channel,  for  image  Ij  are  Cj,k,  where  k  ranges  over  four  dimensions:  orientation,  frequency,  x,  and 


y- 


•  We  compute  the  detectability  of  each  coefficient  (dk)  by  taking  the  absolute  difference 
between  image  and  background  coefficients: 

dk  =  \Cl,k  _  C0,k  I 


•  We  implemented  masking  for  super-threshold  channels  by  decreasing  dk  by  a  factor 
related  to  the  background  channel  signal  when  the  background  channel  exceeds  detection 
threshold: 

,  _  | Clk  -  C0,k  | 

ak  ~  nr  ^7n 

max  (1,|  co  k  ) 

•  The  overall  distinctness  metric,  d,  is  calculated  as  the  Minkowski  sum  of  the  individual 
coefficients: 


NOTES: 


d  = 


A 


i 

7 


V  k  J 


•  Based  on  psychophysics  and  physiology 

•  Contrast  sensitivity  function  and  SF  decomposition  of  image  are  part  of 

calculation. 

•  Comparison  between  two  images  (one  with  target  and  one  without)  to  determine 
detectability  of  the  target. 

•  Incorporates  masking  (was  determined  to  over-predict  performance  without  it  [Rohaly 
et  al„  1997]) 

•  Only  for  achromatic  images. 
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ONLY)  DTICOCA 

8725  JOHN  J  KINGMAN  RD 
STE  0944 

FORT  BELVOIR  VA  22060-6218 

1  US  ARMY  RSRCH  DEV  &  ENGRG  CMD 
SYSTEMS  OF  SYSTEMS 
INTEGRATION 
AMSRD  SS  T 
6000  6TH  ST  STE  100 
FORT  BELVOIR  VA  22060-5608 

1  INST  FOR  ADVNCD  TCHNLGY 
THE  UNIV  OF  TEXAS  AT  AUSTIN 
3925  W  BRAKER  LN  STE  400 
AUSTIN  TX  78759-5316 

1  DIRECTOR 

US  ARMY  RESEARCH  LAB 
IMNE  ALC  IMS 
2800  POWDER  MILL  RD 
ADELPHI  MD  20783-1197 

1  DIRECTOR 

US  ARMY  RESEARCH  LAB 
AMSRD  ARL  Cl  OK  TL 
2800  POWDER  MILL  RD 
ADELPHI  MD  20783-1197 

2  DIRECTOR 

US  ARMY  RESEARCH  LAB 
AMSRD  ARL  CS  OK  T 
2800  POWDER  MILL  RD 
ADELPHI  MD  20783-1197 

1  ARMY  RSCH  LABORATORY  -  HRED 
ATTN  AMSRD  ARL  HR  M  DR  M  STRUB 
6359  WALKER  LANE  SUITE  100 
ALEXANDRIA  VA  22310 

1  ARMY  RSCH  LABORATORY  -  HRED 
ATTN  AMSRD  ARL  HR  MA  J  MARTIN 
MYER  CENTER  RM  2D3 1 1 
FT  MONMOUTH  NJ  07703-5630 

1  ARMY  RSCH  LABORATORY  -  HRED 

ATTN  AMSRD  ARL  HR  MC  A  DAVISON 
320  MANSCEN  LOOP  STE  166 
FT  LEONARD  WOOD  MO  65473-8929 


1  ARMY  RSCH  LABORATORY  -  HRED 
ATTN  AMSRD  ARL  HR  MD  T  COOK 
BLDG  5400  RM  C242 
REDSTONE  ARSENAL  AL  35898-7290 

1  COMMANDANT  USAADASCH 
ATTN  ATS  A  CD 

ATTN  AMSRD  ARL  HR  ME  MS  A  MARES 
5800  CARTER  RD 
FT  BLISS  TX  79916-3802 

1  ARMY  RSCH  LABORATORY  -  HRED 

ATTN  AMSRD  ARL  HR  MO  J  MINNINGER 

BLDG  5400  RM  C242 

REDSTONE  ARSENAL  AL  35898-7290 

1  ARMY  RSCH  LABORATORY  -  HRED 
ATTN  AMSRD  ARL  HR  MM  DR  V  RICE 
BLDG  4011  RM  217 
1750  GREELEY  RD 
FT  SAM  HOUSTON  TX  78234-5094 

1  ARMY  RSCH  LABORATORY  -  HRED 
ATTN  AMSRD  ARL  HR  MG  R  SPINE 
BUILDING  333 

PICATINNY  ARSENAL  NJ  07806-5000 

1  ARL  HRED  ARMC  FLD  ELMT 

ATTN  AMSRD  ARL  HR  MH  C  BURNS 
BLDG  1467B  ROOM  336 
THIRD  AVENUE 
FT  KNOX  KY  40121 

1  ARMY  RSCH  LABORATORY  -  HRED 
AVNC  FIELD  ELEMENT 
ATTN  AMSRD  ARL  HR  MJ  D  DURBIN 
BLDG  4506  (DCD)  RM  107 
FT  RUCKER  AL  36362-5000 

1  ARMY  RSCH  LABORATORY  -  HRED 

ATTN  AMSRD  ARL  HR  MK  MR  J  REINHART 
10125  KINGMAN  RD 
FT  BELVOIR  VA  22060-5828 

1  ARMY  RSCH  LABORATORY  -  HRED 
ATTN  AMSRD  ARL  HR  MV  HQ  USAOTC 
S  MIDDLEBROOKS 
91012  STATION  AVE  ROOM  111 
FT  HOOD  TX  76544-5073 
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1  ARMY  RSCH  LABORATORY  -  HRED 
ATTN  AMSRD  ARL  HR  MY  M  BARNES 
2520  HEALY  AVE  STE  1 172  BLDG  51005 
FT  HUACHUCA  AZ  85613-7069 

1  ARMY  RSCH  LABORATORY  -  HRED 

ATTN  AMSRD  ARL  HR  MP  D  UNGV ARSKY 
BATTLE  CMD  BATTLE  LAB 
415  SHERMAN  AVE  UNIT  3 
FT  LEAVENWORTH  KS  66027-2326 

1  ARMY  RSCH  LABORATORY  -  HRED 

ATTN  AMSRD  ARL  HR  MJK  J  HANSBERGER 
JFCOM  JOINT  EXPERIMENTATION  J9 
JOINT  FUTURES  LAB 
115  LAKEVIEW  PARKWAY  SUITE  B 
SUFFOLK  VA  23435 

1  ARMY  RSCH  LABORATORY  -  HRED 

ATTN  AMSRD  ARL  HR  MQ  M  R  FLETCHER 
US  ARMY  SBCCOM  NATICK  SOLDIER  CTR 
AMSRD  NSC  SSE  BLDG  3  RM  341 
NATICK  MA  01760-5020 

1  ARMY  RSCH  LABORATORY  -  HRED 
ATTN  AMSRD  ARL  HR  MY  DR  J  CHEN 
12423  RESEARCH  PARKWAY 
ORLANDO  FL  32826 

1  ARMY  RSCH  LABORATORY  -  HRED 

ATTN  AMSRD  ARL  HR  MS  MR  C  MANASCO 
SIGNAL  TOWERS  RM  3  03 A 
FORT  GORDON  GA  30905-5233 

1  ARMY  RSCH  LABORATORY  -  HRED 

ATTN  AMSRD  ARL  HR  MU  M  SINGAPORE 
6501  E  11  MILE  RD  MAIL  STOP  284 
BLDG  200A  2ND  FL  RM  2104 
WARREN  MI  48397-5000 

1  ARMY  RSCH  LABORATORY  -  HRED 

ATTN  AMSRD  ARL  HR  MF  MR  C  HERNANDEZ 
BLDG  3040  RM  220 
FORT  SILL  OK  73503-5600 

1  ARMY  RSCH  LABORATORY  -  HRED 
ATTN  AMSRD  ARL  HR  MW  E  REDDEN 
BLDG  4  ROOM  332 
FT  BENNING  GA  31905-5400 


1  ARMY  RSCH  LABORATORY  -  HRED 
ATTN  AMSRD  ARL  HR  MN  R  SPENCER 
DCSFDI  HF 

HQ  USASOC  BLDG  E2929 
FORT  BRAGG  NC  28310-5000 

1  ARMY  G1 

ATTN  DAPE  MR  B  KNAPP 

300  ARMY  PENTAGON  ROOM  2C489 

WASHINGTON  DC  20310-0300 

1  US  ARMY  NATICK  SOLDIER  CTR 
FUTURE  FORCE  WARRIOR 
ATTN  AMSRB  NSC  W  C  BLACKWELL 
NATICK  MA  01760-5020 

1  UNIV  OF  CENTRAL  FLORIDA 

DEPT  OF  PSYCHOLOGY  R  GILSON 
4000  CENTRAL  FLORIDA  BLVD 
ORLANDO  FL  32816-1390 

ABERDEEN  PROVING  GROUND 

1  DIRECTOR 

US  ARMY  RSCH  LABORATORY 
ATTN  AMSRD  ARL  Cl  OK  (TECH  LIB) 
BLDG  4600 

1  DIRECTOR 

US  ARMY  RSCH  LABORATORY 

ATTN  AMSRD  ARL  Cl  OK  TP  S  FOPPIANO 

BLDG  459 

1  DIRECTOR 

US  ARMY  RSCH  LABORATORY 

ATTN  AMSRD  ARL  HR  MR  F  PARAGALLO 

BLDG  459 

1  DIRECTOR 

US  ARMY  RSCH  LABORATORY 
ATTN  AMSRD  ARL  HR  S  L  PIERCE 
BLDG  459 

3  DIRECTOR 

US  ARMY  RSCH  LABORATORY 
ATTN  AMSRD  ARL  HR  SD  B  VAUGHAN 
BLDG  459 
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